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Abstract 


Nonparametric methods play a central role in modern empirical work. While they 
provide inference procedures that are more robust to parametric misspecification bias, 
they may be quite sensitive to tuning parameter choices. We study the effects of bias 
correction on confidence interval coverage in the context of kernel density and local 
polynomial regression estimation, and prove that bias correction can be preferred to 
undersmoothing for minimizing coverage error and increasing robustness to tuning pa¬ 
rameter choice. This is achieved using a novel, yet simple, Studentization, which leads to 
a new way of constructing kernel-based bias-corrected confidence intervals. In addition, 
for practical cases, we derive coverage error optimal bandwidths and discuss easy-to- 
implement bandwidth selectors. For interior points, we show that the MSE-optimal 
bandwidth for the original point estimator (before bias correction) delivers the fastest 
coverage error decay rate after bias correction when second-order (equivalent) kernels 
are employed, but is otherwise suboptimal because it is too “large”. Finally, for odd- 
degree local polynomial regression, we show that, as with point estimation, coverage 
error adapts to boundary points automatically when appropriate Studentization is used; 
however, the MSE-optimal bandwidth for the original point estimator is suboptimal. 

All the results are established using valid Edgeworth expansions and illustrated with 
simulated data. Our findings have important consequences for empirical work as they 
indicate that bias-corrected confidence intervals, coupled with appropriate standard er¬ 
rors, have smaller coverage error and are less sensitive to tuning parameter choices in 
practically relevant cases where additional smoothness is available. 

Keywords: Edgeworth expansion, coverage error, kernel methods, local polynomial re¬ 
gression. 



1 Introduction 


Nonparametric methods are widely employed in empirical work, as they provide point es¬ 
timates and inference procedures that are more robust to parametric misspecihcation bias. 
Kernel-based methods are commonly used to estimate densities, conditional expectations, 
and related functions nonparametrically in a wide variety of settings. However, these meth¬ 
ods require specifying a bandwidth and their performance in applications crucially relies on 
how this tuning parameter is chosen. In particular, valid inference requires the delicate bal¬ 
ancing act of selecting a bandwidth small enough to remove smoothing bias, yet large enough 
to ensure adequate precision. Tipping the scale in either direction can greatly skew results. 
This paper studies kernel density and local polynomial regression estimation and inference 
based on the popular Wald-type statistics and demonstrates (via higher-order expansions) 
that by coupling explicit bias correction with a novel, yet simple, Studentization, inference 
can be made substantially more robust to bandwidth choice, greatly easing implementability. 

Perhaps the most common bandwidth selection approach is to minimize the asymptotic 
mean-square error (MSE) of the point estimator, and then use this bandwidth choice even 
when the goal is inference. So difficult is bandwidth selection perceived to be, that despite 
the fact that the MSE-optimal bandwidth leads to invalid confidence intervals, even asymp¬ 
totically, this method is still advocated, and is the default in most popular software. Indeed, 
Hall and Kang (2001, p. 1446) write: “there is a growing belief that the most appropriate 
approach to constructing confidence regions is to estimate [the density] in a way that is opti¬ 
mal for pointwise accuracy.... [I]t has been argued that such an approach has advantages of 
clarity, simplicity and easy interpretation.” 

The underlying issue, as formalized below, is that bias must be removed for valid inference, 
and (in particular) the MSE-optimal bandwidth is “too large”, leaving a bias that is still first 
order. Two main methods have been proposed to address this: undersmoothing and explicit 
bias correction. We seek to compare these two, and offer concrete ways to better implement 
the latter. Undersmoothing amounts to choosing a bandwidth smaller than would be optimal 
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for point estimation, then argue that the bias is smaller than the variability of the estimator 
asymptotically, leading to valid distributional approximations and confidence intervals. In 
practice this method often involves simply shrinking the MSE-optimal bandwidth by an ad- 
hoc amount. The second approach is to bias correct the estimator with the explicit goal of 
removing the bias that caused the invalidity of the inference procedure in the Erst place. 

It has long been believed that undersmoothing is preferable for two reasons. First, the¬ 
oretical studies showed inferior asymptotic coverage properties of bias-corrected confidence 
intervals. The pivotal work was done by Hall (1992b), and has been relied upon since. Sec¬ 
ond, implementation of bias correction is perceived as more complex because a second (usually 
different) bandwidth is required, deterring practitioners. However, we show theoretically that 
bias correction is always as good as undersmoothing, and better in many practically relevant 
cases, if the new standard errors that we derive are used. Further, our findings have impor¬ 
tant implications for empirical work because the resulting confidence intervals are more robust 
to bandwidth choice, including to the bandwidth used for bias estimation. Indeed, the two 
bandwidths can be set equal, a simple and automatic choice that performs well in practice 
and is even optimal in certain objective senses. 

Our proposed robust bias correction method delivers valid confidence intervals (and related 
inference procedures) even when using the MSE-optimal bandwidth for the original point 
estimator, the most popular approach in practice. Moreover, we show that at interior points, 
when using second-order kernels or local linear regressions, the coverage error of such intervals 
vanishes at the best possible rate. (Throughout, the notion of “optimal” or “best” rate is 
defined as the fastest achievable coverage decay for a fixed kernel order or polynomial degree; 
and is also different from optimizing point estimation.) When higher-order kernels are used, 
or boundary points are considered, we find that the corresponding MSE-optimal bandwidth 
leads to asymptotically valid intervals, but with suboptimal coverage error rates, and must be 
shrunk (sometimes considerably) for better inference. 

Heuristically, employing the MSE-optimal bandwidth for the original point estimator, prior 
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to bias correction, is like undersmoothing the bias-corrected point estimator, though the latter 
estimator employs a possibly random, n -varying kernel, and requires a different Studentization 
scheme. It follows that the conventional MSE-optimal bandwidth commonly used in practice 
need not be optimal, even after robust bias correction, when the goal is inference. Thus, 
we present new coverage error optimal bandwidths and a fully data-driven direct plug-in 
implementation thereof, for use in applications. In addition, we study the important related 
issue of asymptotic length of the new confidence intervals. 

Our comparisons of undersmoothing and bias correction are based on Edgeworth expan¬ 
sions for density estimation and local polynomial regression, allowing for different levels of 
smoothness of the unknown functions. We prove that explicit bias correction, coupled with 
our proposed standard errors, yields confidence intervals with coverage that is as accurate, or 
better, than undersmoothing (or, equivalently, yields dual hypothesis tests with lower error 
in rejection probability). Loosely speaking, this improvement is possible because explicit bias 
correction can remove more bias than undersmoothing, while our proposed standard errors 
capture not only the variability of the original estimator but also the additional variability from 
bias correction. To be more specific, our robust bias correction approach yields higher-order 
refinements whenever additional smoothness is available, and is asymptotically equivalent to 
the best undersmoothing procedure when no additional smoothness is available. 

Our Endings contrast with well established recommendations: Hall (1992b) used Edge- 
worth expansions to show that undersmoothing produces more accurate intervals than ex¬ 
plicit bias correction in the density case and Neumann (1997) repeated this finding for kernel 
regression. The key distinction is that their expansions, while imposing the same levels of 
smoothness as we do, crucially relied on the assumption that the bias correction was first- 
order negligible, essentially forcing bias correction to remove less bias than undersmoothing. 
In contrast, we allow the bias estimator to potentially have a first order impact, an alter¬ 
native asymptotic experiment designed to more closely mimic the finite-sample behavior of 
bias correction. Therefore, our results formally show that whenever additional smoothness is 
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available to characterize leading bias terms, as it is usually the case in practice where MSE- 
optimal bandwidth are employed, our robust bias correction approach yields higher-order 
improvements relative to standard undersmoothing. 

Our standard error formulas are based on fixed-n calculations, as opposed to asymptotics, 
which also turns out to be important. We show that using asymptotic variance formulas can 
introduce further errors in coverage probability, with particularly negative consequences at 
boundary points. This turns out to be at the heart of the “quite unexpected” conclusion 
found by Chen and Qin (2002, Abstract) that local polynomial based confidence intervals 
are not boundary-adaptive in coverage error: we prove that this is not the case with proper 
Studentization. Thus, as a by-product of our main theoretical work, we establish higher-order 
boundary carpentry of local polynomial based confidence intervals that use a fixed-n standard 
error formula, a result that is of independent (but related) interest. 

This paper is connected to the well-established literature on nonparametric smoothing, see 
Wand and Jones (1995), Fan and Gijbels (1996), Horowitz (2009), and Ruppert et al. (2009) for 
reviews. For more recent work on bias and related issues in nonparametric inference, see Hall 
and Horowitz (2013), Calonico et al. (2014), Hansen (2015), Armstrong and Kolesar (2015), 
Schennach (2015), and references therein. We also contribute to the literature on Edgeworth 
expansions, which have been used both in parametric and, less frequently, nonparametric 
contexts; see, e.g., Bhattacharya and Rao (1976) and Hall (1992a). Fixed-n versus asymptotic- 
based Studentization has also captured some recent interest in other contexts, e.g., Mykland 
and Zhang (2015). Finally, see Calonico et al. (2016) for uniformly valid Edgeworth expansions 
and optimal inference in the context of regression discontinuity designs. 

The paper proceeds as follows. Section 2 studies density estimation at interior points and 
states the main results on error in coverage probability and its relationship to bias reduction 
and underlying smoothness, as well as discussing bandwidth choice and interval length. Section 
3 then studies local polynomial estimation, at interior and boundary points. Practical guidance 
is explicitly discussed in Sections 2.4 and 3.3, respectively; all methods are available in the R 
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package nprobust on CRAN. Section 4 summarizes the results of a Monte Carlo study, and 
Section 5 concludes. Some technical details, all proofs, and additional simulation evidence are 
collected in a lengthy online supplement. 

2 Density Estimation and Inference 

We first present our main ideas and conclusions for inference on the density at an interior 
point, as this requires relatively little notation. The data are assumed to obey the following. 

Assumption 2.1 (Data-generating process). {Ad,..., X n } is a random sample with an ab¬ 
solutely continuous distribution with Lebesgue density f. In a neighborhood of x, f > 0, f is 
S-times continuously differentiable with bounded derivatives f^ s \ s — 1,2, ■ • • , S, and f ^ is 
Holder continuous with exponent g. 

The parameter of interest is /(x) for a fixed scalar point x in the interior of the support. 
[In the supplemental appendix we discuss how our results extend naturally to multivariate A f 
and derivative estimation.] The classical kernel-based estimator of /(x) is 



( 1 ) 


for a kernel function K that integrates to 1 and positive bandwidth h —> 0 as n —> oo. 
The choice of h can be delicate, and our work is motivated in part by the standard empirical 
practice of employing the MSE-optimal bandwidth choice for /(x) when conducting inference. 


In this vein, let us suppose for the moment that K is a kernel of order h, where fi < S so 
that the MSE-optimal bandwidth can be characterized. The bias is then given by 


E[/(x)] - /(x) = h*f {ri) (x)n KA + o(h A ), 


( 2 ) 
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where f^\x) := d k f [x) j dx h and hk,a = f u k K(u)du/k\. Computing the variance gives 


(nh)Y[f(x)\ 



K 


x - Xi 
h 


-E 


K 


(^)l 



( 3 ) 


which is non-asymptotic, n and h are fixed in this calculation. Using other, first-order valid 
approximations, e.g. (nh)Y[f(x)} ~ f(x) J K(u) 2 du, will have finite sample consequences 
that manifest as additional terms in the Edgeworth expansions. In fact, Section 3 shows that 
using an asymptotic variance for local polynomial regression removes automatic coverage-error 
boundary adaptivity. 

Together, the prior two displays are used to characterize the MSE-optimal bandwidth, 
h* se oc n _1 ^ 1+2I \ however, using this bandwidth leaves a bias that is too large, relative 
to the variance, to conduct valid inference for /(x). To address this important practical 
problem, researchers must either undersmooth the point estimator (i.e., construct f(x) with a 
bandwidth smaller than h* se ) or bias-correct the point estimator (i.e., estimate and subtract 
the leading bias after using h* S6 , or a “larger” bandwidth, to construct f(x)). Thus, the 
question we seek to answer is this: if the bias is given by (2), is one better off estimating the 
leading bias (explicit bias correction) or choosing h small enough to render the bias negligible 
(undersmoothing) when forming nonparametric confidence intervals? 

To answer this question, and to motivate our new robust approach, we first detail the bias 
correction and variance estimators. Explicit bias correction estimates the leading term of Eqn. 
(2), denoted by Bf , using a kernel estimator of f( k \x), defined as: 

UW'W* where f‘\x) = -X £ L<‘> U^) . 

i= 1 k ' 


for a kernel L(-) of order £ and a bandwidth b —> 0 as n —> oo. Importantly, Bf takes this 
form for any fi and S, even if (2) fails; see Sections 2.2 and 2.3 for discussion. Conventional 
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Studentized statistics based on undersmoothing and explicit bias correction are, respectively, 


T us (x) = 


Vnh{f(x) - f(x)) 


and 


T h ,(x) = 


Vnh(f(x) -B f - f(x)) 


cr„ 


O',, 


where <5^ s := V[/(a;)] is the natural estimator of the variance of f(x) which only replaces 
the two expectations in (3) with sample averages, thus maintaining the nonasymptotic spirit. 
These are the two statistics compared in the influential paper of Hall (1992b), under the same 
assumption imposed herein. 

From the form of these statistics, two points are already clear. First, the numerator of T us 
relies on choosing h vanishing fast enough so that the bias is asymptotically negligible after 
scaling, whereas T bc allows for slower decay by virtue of the manual estimation of the leading 
bias. Second, T bc requires that the variance of h k p k \x)nK,n be first-order asymptotically 
negligible: <r us in the denominator only accounts for the variance of the main estimate, but 
f( A \x), being a kernel-based estimator, naturally has a variance controlled by its bandwidth. 
That is, even though <3^ s is based on a fixed-n calculation, the variance of the numerator 
of T bc only coincides with the denominator asymptotically. Under this regime, Hall (1992b) 
showed that the bias reduction achieved in T bc is too expensive in terms of noise and that 
undersmoothing dominates explicit bias correction for coverage error. 

We argue that there need not be such a “mismatch” between the numerator of the bias- 
corrected statistic and the Studentization, and thus consider a third option corresponding to 
the idea of capturing the finite sample variability of f^(x) directly. To do so, note that we 
may write, after setting p = h/b, 


fix) - h k f (h) (x)p K ,n = ( X Xi J , M(u) = K(u) - p 1+l! L (,s) (pu)p K /,- (4) 

We then define the collective variance of the density estimate and the bias correction as 
a rbc = i n h)^[f( x ) ~ Bf], exactly as in Eqn. (3), but with M(-) in place of K(-), and its 
estimator d^ bc exactly as d^ s . Therefore, our proposed robust bias corrected inference approach 
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is based on 


_ Vnh(f(x ) - h r, f {rt \x)p Kfi - f(x )) 

rbc 

^rbc 

That is, our proposed standard errors are based on a fixed-n calculation that captures the 
variability of both f(x) and p r, \x), and their covariance. As shown in Section 3, the case of 
local polynomial regression is qualitatively analogous, but notationally more complicated. 

The quantity p = h/b is key. If p —> 0, then the second term of M is dominated by 
the first, i.e. the bias correction is first-order negligible. In this case, cr^ s and a^ bc (and their 
estimators) will be first-order, but not higher-order, equivalent. This is exactly the sense in 
which traditional bias correction relies on an asymptotic variance, instead of a fixed-n one, 
and pays the price in coverage error: for any finite sample, standard bias corrected inference 
can substantially impact the results. To more accurately capture finite sample behavior of 
bias correction we allow p to converge to any (nonnegative) finite limit, allowing (but not 
requiring) the bias correction to be first-order important, unlike prior work. We show that 
doing so yields more accurate confidence intervals (i.e., higher-order corrections). 


2.1 Generic Higher Order Expansions of Coverage Error 

We first present generic Edgeworth expansions for all three procedures (undersmoothing, 
traditional bias correction and robust bias correction), which are agnostic regarding the level 
of available smoothness (controlled by S in Assumption 2.1). To be specific, we give higher- 
order expansions of the error in coverage probability of the following (1 — a)% confidence 
intervals based on Normal approximations for the statistics T us , T bc , and T rbc : 


AlS 

Ac 

Abe 


/ 


^”us 

Z !-<*—== 

2 Vnh 


, / 


^"us 


^US 

Z a ——-— 


, f~ B f 




^”us 

Z°L—= 

2 Vnh. 


^rbc 

z*—= 
2 Vnh, 


and 


(5) 















where z a is the upper a-percentile of the Gaussian distribution. Here and in the sequel we omit 
the point of evaluation x for simplicity. Equivalently, our results can characterize the error in 
rejection probability of the corresponding hypothesis tests. In the following subsections, we 
give specific results under different smoothness assumptions and make direct comparisons of 
the methods. 

We require the following standard conditions on the kernels K and L. 

Assumption 2.2 (Kernels). The kernels K and L are bounded, even functions with support 
[—1,1], and are of order fi> 2 and l > 2 , respectively, where h and i are even integers. That 
is, 1- 1 ks) = 1, Pk,u = 0 for 1 < k < h, and Pka f 0 and bounded, and similarly for p L , k with £ 
in place of fi. Further, L is h-times continuously differentiable. For all integers k and l such 
that k + l — k — 1, f < ' k \x 0 )L^ l \(x 0 — x)/b) = 0 for x 0 in the boundary of the support. 

The boundary conditions are needed for the derivative estimation inherent in bias cor¬ 
rection, even if x is an interior point, and are satisfied if the support of / is the whole real 
line. Higher order results also require a standard n-varying Cramer’s condition, given in the 
supplement to conserve space (see Section S.I.3). Altogether, our assumptions are identical 
to those of Hall (1991, 1992b). 

To state the results some notation is required. First, let the (scaled) biases of the density 
estimator and the bias-corrected estimator be r/ us = y/nh{K[f] — f) and i] hc — Vnh(^[f — 
Bf] — /). Next, let <f>(z) be the standard Normal density, and for any kernel K define 

Qi( k ) = ^k 2 2^ka(zI - 3z“)/6 - ^2^, 3 [2^ 3 / 3 + 0| - !04 + 15^«)/9], 
q 2 (K) = -0^2 *f, and Ti( K ) = d A 2 2^v3(24/3), 

where xi km — f K[u) k du. All that is conceptually important is that these functions are known, 
odd polynomials in z with coefficients that depend only on the kernel, and not on the sample 
or data generating process. Our main theoretical result for density estimation is the following. 
Theorem 1. Let Assumptions 2.1, 2.2, and Cramer's condition hold and nh/\og(nh) -A- oo. 
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(a) If r] ns —> 0, then 


P[/ ^ Ins] — 1 — a + l—qi(K) + r]~ s q 2 (K) + ^=qs(K)\ — y— {1 + o(l)}. 

(b) If V be —^ 0 and p —> 0, then 

P[/ € Ac] = 1 — a + |—gi(A) + rjl c q 2 {K) + -^==g 3 (A)| — y— {1 + o(l)} 

+ p 1+/ ‘(^i + p /? ^2)0(^f)^f {1 + o(l)}, 

for constants fA and fl 2 given precisely in the supplement. 

(c) If 7/bc —>• 0 and p —$■ p < oo, then 

P[/ ^ Abe] = 1 — a + {—qi(M) + pl c q 2 {M) + -^==g 3 (M)| — y— {1 + o(l)}. 

This result leaves the scaled biases p us and 7/bc generic, which is useful when considering 
different levels of smoothness S, the choices of k and I, and in comparing to local polynomial 
results. In the next subsection, we make these quantities more precise and compare them, 
paying particular attention to the role of the underlying smoothness assumed. 

At present, the most visually obvious feature of this result is that all the error terms are of 
the same form, except for the notable presence of p 1+/? (f2i + p k Vt 2 ) in part (b). These are the 
leading terms of o^bc/AL — A consisting of the covariance of / and Bf (denoted by fA) and 
the variance of Bf (denoted by f2 2 ), and are entirely due to the “mismatch” in Studentization 
scheme underlying The- Hall (1992b) showed how these terms prevent bias correction from 
performing as well as undersmoothing in terms of coverage. In essence, the potential for 
improved bias properties do not translate into improved inference because the variance is not 
well-controlled: in any finite sample, Bf would inject variability (i.e., p = h/h > 0 for each n ) 
and thus p —> 0 may not be a good approximation. Our new Studentization does not simply 
remove the leading p terms; the entire sequence is absent. As explained below, allowing for 
p = oo can not reduce bias, but will inflate variance; hence restricting to p < oo capitalizes 
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fully on the improvements from bias correction. 


2.2 Coverage Error and the Role of Smoothness 

Theorem 1 makes no explicit assumption about smoothness beyond the requirement that the 
scaled biases vanish asymptotically. The fact that the error terms in parts (a) and (c) of 
Theorem 1 take the same form implies that comparing coverage error amounts to comparing 
bias, for which the smoothness S and the kernel orders k and l are crucial. We now make the 
biases r/ us and r/ bc concrete and show how coverage is affected. 

For J us , two cases emerge: (a) enough derivatives exist to allow characterization of the 
MSE-optimal bandwidth (k < S)\ and (b) no such smoothness is available (k > S ), in which 
case the leading term of Eqn. (2) is exactly zero and the bias depends on the unknown Holder 
constant. These two cases lead to the following results. 

Corollary 1. Let Assumptions 2.1, 2.2, and Cramer’s condition hold and nh/ \og{nh) —s- oo. 

(a) If k < S and \/nhh k — >■ 0, 

P[/ 6 Cs] = l-«+|^g 1 (iP)+ri^ 1+2 "(/ ( " ) ) 2 ^g 2 (K)+/ l V { " ) ^^3(//)}^ i ^ {l+o(l)}. 

(b) If k > S and a /nhh s+<; —» 0, 

1 d )() 

P[/ eI UB ] = l-a+ i(AT) {1 + 0(1)} + O (nh 1+2 ^ + h 5+? ) . 

The first result is most directly comparable to Hall (1992b, §3.4), and many other past 
papers, which typically take as a starting point that the MSE-optimal bandwidth can be 
characterized. This shows that T us must be undersmoothed, in the sense the MSE-optimal 
bandwidth is “too large” to be valid: h* se oc 77 , _1 /( 1+2/ d is not allowed. In fact, / us (/i* se ) 
asymptotically undercovers because T us (/i* se ) —>d N(l,l); for example if a = 0.05, P[/ G 
J us (h* se )] ~ 0.83. Instead, the optimal h for coverage error, which can be characterized and 
estimated, is equivalent in rates to balancing variance against bias, not squared bias as in 


11 




MSE. Part (b) shows that a faster rate of coverage error decay can be obtained by taking a 
sufficiently high order kernel, relative to the level of smoothness S, at the expense of feasible 
bandwidth selection. 

Turning to robust bias correction, characterization of r/ hc is more complex as it has two 
pieces: the second-order bias of the original point estimator, and the bias of the bias estimator 
itself. The former is the o(h fl ) term of Eqn. (2) and is not the target of explicit bias correction; 
it depends either on higher derivatives, if they are available, or on the Holder condition 
otherwise. To be precise, if fi < S — 2, this term is [ h h+2 + o(1)]/^ +2 ' l /i ^ bc ^ +2 , while otherwise 
is known only to be 0(h s+<; ). Importantly, the bandwidth b and order £ do not matter here, 
and bias reduction beyond 0(min{h A+2 , h s+<; }) is not possible; there is thus little or no loss 
in fixing £ = 2, which we assume from now on to simplify notation. 

The bias of the bias estimator also depends on the smoothness available: if enough 
smoothess is available the corresponding bias term can be characterized, otherwise only its 
order will be known. To be specific, when smoothness is not binding (fi < S — 2), arguably 
the most practically-relevant case, the leading term of E [Bf] — Bf will be h^b 2 f^ +2 l hk,hI^l, 2 - 
Smoothness can be exhausted in two ways, either by the point estimate itself (fi > S ) or by 
the bias estimation (S — 1 < k < S), and these two cases yield 0(h h b s ^ k ) and 0(hf t b s+, '~ lk ), 
respectively, which are slightly different in how they depend on the total Holder smoothness 
assumed. (Complete details are in the supplement.) Note that regardless of the value of k, 
we set Bf = h k {J> k,a, even if h > S and Bf = 0. 

With these calculations for ?/ bc , we have the following result. 

Corollary 2. Let Assumptions 2.1, 2.2, and Cramer’s condition hold, nh/\og(nh) —> oo, 
p —>■ p < oo, and £ = 2. 

(a) If h < S — 2 and \fnhh k b 2 —> 0, 

E[/ e / r be] = 1 - a + + nh 1+2(/t+2) (f (/i+2} ) 2 (p K ,k +2 + p~ 2 PkaPlp) 2 q 2 (M p ) 

+ h L+2 fd' +2 l [p , K ^ + 2 + p 2 pKJkpLp) - -J— {1 + 0(1)}. 
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(b) IfS-l<k<S and Vnhp li b s+,; —» 0 


P[/ G /rbc] = l-a+ iW {1 + 0(1)} + 0 (n/ip M 6 2(s+?) + A s+? ) . 

(c) If k > S and \/nh(h s+<; V //fr 5 ) — >■ 0, 

1 d>( ) 

P[/ G / rbc ] = 1 - a + —-y^gi(Mp){l + 0 ( 1 )} + O (nh(h s+ <Vp h b s f + {h s+ ^p k b s )) . 

Part (a) is the most empirically-relevant setting, which reflects the idea that researchers 
first select a kernel order, then conduct inference based on that choice, taking the unknown 
smoothness to be nonbinding. The most notable feature of this result, beyond the formaliza¬ 
tion of the coverage improvement, is that the coverage error terms share the same structure 
as those of Corollary 1, with k replaced by k +2, and represent the same conceptual ob¬ 
jects. By virtue of our new Studentization, the leading variance remains order (nh) _1 and the 
problematic correlation terms are absent. We explicitly discuss the advantages of robust bias 
correction relative to undersmoothing in the following section. 

Part (a) also argues for a bounded, positive p. First, because bias reduction beyond 
0(h k+2 ) is not possible, p —» oo will only inflate the variance. On the other hand, p — 0 
requires a delicate choice of b and I > 2, else the second bias term dominates r] hc , and the full 
power of the variance correction is not exploited; that is, more bias may be removed without 
inflating the variance rate. Hall (1992b, p. 682) remarked that if E[/] — f — Bf is (part of) 
the leading bias term, then “explicit bias correction [... ] is even less attractive relative to 
undersmoothing.” We show, on the contrary, that with our proposed Studentization, it is 
optimal that E[/] — / — Bf is part of the dominant bias term. 

Finally, in both Corollaries above the best possible coverage error decay rate (for a given 
S is attained by exhausting all available smoothness. This would also yield point estimators 
attaining the bound of Stone (1982); robust bias correction can not evade such bounds, of 
course. In both Corollaries, coverage is improved relative to part (a), but the constants and 
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optimal bandwidths can not be quantified. For robust bias correction, Corollary 2 shows that 
to obtain the best rate in part (b) the unknown /^ must be consistently estimated and p 
must be bounded and positive, while in part (c), bias estimation merely adds noise, but this 
noise is fully accounted for by our new Studentization, as long as p —> 0 {b 0 is allowed). 

2.3 Comparing Undersmoothing and Robust Bias Correction 

We now employ Corollaries 1 and 2 to directly compare nonparametric inference based on un¬ 
dersmoothing or robust bias correction. To simplify the discussion we focus on three concrete 
cases, which illustrate how the comparisons depend on the available smoothness and kernel 
order; the messages generalize to any S and/or k. For this discussion we let k ns and k bc be 
the kernel orders used for point estimation in J us and J rbc , respectively, and restrict attention 
to sequences h —> 0 where both confidence intervals are first-order valid, even though robust 
bias correction allows for a broader bandwidth range. Finally, we set £ = 2 and p G (0, oo) 
based on the above discussion. 

For the first case, assume that / is twice continuously differentiable (S' = 2) and both 
methods use second order kernels (^ us = k hc = £ = 2). In this case, both methods target the 
same bias. The coverage errors for J us and J rbc then follow directly from Corollaries 1(a) and 
2(b) upon plugging in these kernel orders, yielding 

|P[/ G / us ] — (1 - a)| x -^j-+nh 5 + li 2 and |P[/ G J rbc ] - (l-a)| x + nh 5+2<; + /r 2+ F 

Because h —> 0 and p G (0, oo), the coverage error of J rbc vanishes more rapidly by virtue of 
the bias correction. A higher order kernel (/! us > 2) would yield this rate for J us . 

Second, suppose that the density is four-times continuously differentiable (S' = 4) but 
second order kernels are maintained. The relevant results are now Corollaries 1(a) and 2(a). 
Both methods continue to target the same leading bias, but now the additional smoothness 
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available allows precise characterization of the improvement shown above, and we have 

|P[/ G / us ] - (1 - a) | X +nh 5 + h 2 and |P[/ £ I Thc \ - (1 - a) | x + nh 9 + h 4 . 

This case is perhaps the most empirically relevant one, where researchers first choose the order 
of the kernel (here, second order) and then conduct/optimize inference based on that choice. 
Indeed, for this case optimal bandwidth choices can be derived (Section 2.4). 

Finally, maintain 5 = 4 but suppose that undersmoothing is based on a fourth-order kernel 
while bias correction continues to use two second-order kernels (/l us = 4, /5 bc = l = 2). This 
is the exact example given by Hall (1992b, p. 676). Now the two methods target different 
biases, but utilize the same amount of smoothness. In this case, the relevant results are again 
Corollaries 1(a) and 2(a), now with fi = 4 and k = 2, respectively. The two methods have the 
same coverage error decay rate: 

|P [/ £ /us] — (1 — a) | x |P[/ £ / r bc] — (1 — a) | x — + nh 9 + h 4 . 

Indeed, more can be said: with the notation of Eqn. (4), the difference between T us and T rbc 
is the change in “kernel” from K , a fixed function, to M, an n-varying, higher-order kernel, 
and since /i bc + t = £ us , the two kernels are the same order. [M acts as a higher-order kernel 
for bias, but may not strictly fit the definition, as explored in the supplement.] This tight link 
between undersmoothing and robust bias correction does not carry over straightforwardly to 
local polynomial regression, as we discussed in more detail in Section 3. 

In the context of this final example, it is worth revisiting traditional bias correction. 
The fact that undersmoothing targets a different, and asymptotically smaller, bias than does 
explicit bias correction, coupled with the requirement that p —>■ 0, implicitly constrains bias 
correction to remove less bias than undersmoothing. This is necessary for traditional bias 
correction, but on the contrary, robust bias correction attains the same coverage error decay 
rate as undersmoothing under the same assumptions. 


15 


In sum, these examples show that under identical assumptions, bias correction is not 
inferior to undersmoothing and if any additional smoothness is available, can yield improved 
coverage error. These results are confirmed in our simulations. 

2.4 Optimal Bandwidth and Data-Driven Choice 

The prior sections established that robust bias correction can equal, or outperform, under¬ 
smoothing for inference. We now show how the method can be implemented to deliver these 
results in applications. We mimic typical empirical practice where researchers first choose 
the order of the kernel, then conduct/optimize inference based on that choice. Therefore, we 
assume the smoothness is unknown but taken to be large and work within Corollary 2(a), 
that is, viewing fi < S — 2 and i = 2 as fixed and p bounded and positive. This setup allows 
characterization of the coverage error optimal bandwidth for robust bias correction. 

Corollary 3. Under the conditions of Corollary 2(a) with p G (0, oo), if h = h* hc = 
H* b c (p)n _1/(1+(/ ' +2)) 7 then P[/ G / rbc ] = 1 — a + 0(n _ ^ +2 ^^ 1+ ^ +2 ^) ; where 

H rbc(p) = argmin) H^q^Mp) + i/ 1+2{/i+2) (/ (/l+2) ) 2 (hk,a+ 2 + P~ 2 Pk^Pl ?) 2 qi{Mp) 
h > o 

+ H ll+2 /( /,+2 ) (^fi Kk+2 + p 2 /j,k,aPl,2 ) Q3(Mp)\. 


We can use this result to give concrete methodological recommendations. At the end of 
this section we discuss the important issue of interval length. Construction of the interval 
/ rb c from Eqn. (5) requires choices of bandwidths h and b and kernels K and L. Given 
these choices, the point estimate, bias correction, and variance estimators are then readily 
computable from data using the formulas above. For the kernels K and L, we recommend 
either second order minimum variance (to minimize interval length) or MSE-optimal kernels 
(see, e.g., Gasser et ah, 1985, and the supplemental appendix). 

The bandwidth selections are more important in applications. For the bandwidth h, Corol¬ 
lary 2(a) shows that the MSE-optimal choice h* se will deliver valid inference, but will be 


16 


suboptimal in general (Corollary 3). From a practical point of view, the robust bias corrected 
interval I Tbc {h) is attractive because it allows for the MSE-optimal bandwidth and kernel, and 
hence is based on the MSE-optimal point estimate, while using the same effective sample for 
both point estimation and inference. Interestingly, although J rbc (h* se ) is always valid, its cov¬ 
erage error decays as r2 ,- mm { 4 4+ 2 }/( 1 + 2/ 0 anc [ j g ^} 1US ra f; e optimal only for second order kernels 
(fi = 2), while otherwise being suboptimal with a slower coverage error rate the larger is the 
kernel order k. 

Corollary 3 gives the coverage error optimal bandwidth, h* bc , which can be implemented 
using a simple direct plug-in (DPI) rule: h dpi = iP dpi n~ l ^ k+2, \ where fP dpi is a plug-in estimate 
of H* hc in Corollary 3 formed by replacing the unknown f^ +2 > with a pilot estimate (e.g., a 
consistent nonparametric estimator based on the appropriate MSE-optimal bandwidth). In 
the supplement we give precise implementation details, as well as an alternative rule-of-thumb 
bandwidth selector based on rescaling already available data-driven MSE-optimal choices. 

For the bandwidth b, a simple choice is b = h, or, equivalently, p — 1. We show in the 
supplement that setting p = 1 has good theoretical properties, minimizing interval length of 
J rbc or the MSE of /, depending on the conditions imposed. In our numerical work, we found 
that p = 1 performed well. As a result, from the practitioner’s point of view, the choice of b 
(or p) is completely automatic, leaving only one bandwidth to select. 

An extensive simulation study, reported in the supplement, illustrates our Endings and 
explores the numerical performance of these choices. We find that coverage of J rbc is robust to 
both h and p and that our data-driven bandwidth selectors work well in practice, but we note 
that estimating bandwidths may have higher-order implications (e.g. Hall and Kang, 2001). 

Finally, an important issue in applications is whether the good coverage properties of J rbc 
come at the expense of increased interval length. When coverage is asymptotically correct, 
Corollaries 1 and 2 show that J rbc can accommodate (and will optimally employ) a larger 
bandwidth (i.e. h —> 0 more slowly), and hence / rbc will have shorter average length in large 
samples than J us . Our simulation study (see below and the supplemental appendix) gives the 
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same conclusion. 


2.5 Other Methods of Bias Correction 


We study a plug-in bias correction method, but there are alternatives. In particular, as pointed 
out by a reviewer, a leading alternative is the generalized jackknife method of Schucany and 
Sommers (1977). We will briefly summarize this approach and show a tight connection to our 
results (restricting to second-order kernels and S > 2 only for simplicity). 

The generalized jackknife estimator is f GG> R := (/i — Rf 2 )/(I — R), where /j and f 2 are two 
initial kernel density estimators, with possibly different bandwidths (hi, h 2 ) and second-order 
kernels (Ki,K 2 ). From Eqn. (2), the bias of f G j,R is (1 — -R) -1 /^ :'*• RhlnK 2 , 2 ) + 

o[h\ + h|), whence choosing R = / (h\p,K 2 , 2 ) renders the leading bias term exactly 

zero. Further, if S > 4, /gj,r has bias O ( h\ + h 2 ); behaving as a point estimator with fi = 4. 
To connect this approach to ours, observe that with this choice of R and p = hi/h 2 , then 


/gj,b 




M{u) = K\{u) - p 1+2 


K 2 {pu) — p 1 Ki(u)\ 

- R) J >lK ' 3 


exactly matching Eqn. (4); alternatively, write f G j ,r = fi ~ hlfWpx 1 , 2 , where 



L{u) 


K 2 (u) — p~ 1 Ki(p~ 1 u) 

Pk 2 , 2(1 — R) 


is a derivative estimator. Therefore, we can view f G j,R as a specific kernel M or a specific 
derivative estimator, and all our results directly apply to /gj,_r; hence our paper offers a new 
way of conducting inference (new Studentization) for this case as well. Though we omit the 
details to conserve space, this is equally true for local polynomial regression (Section 3). 

More generally, our main ideas and generic results apply to many other bias correction 
methods. For a second example, Singh (1977) also proposed a plug-in bias estimator, but 
without using the derivative of a kernel. Our results cover this approach as well. See the 
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supplement for further details and references. The key, common message in all cases is that 
to improve inference the procedure must account for the additional variability introduced by 
a bias correction method (i.e., to avoid the mismatch present in T bc ). 

3 Local Polynomial Estimation and Inference 

This section studies local polynomial regression (Ruppert and Wand, 1994; Fan and Gijbels, 
1996), and has two principal aims. First, we show that the conclusions from the density case, 
and their implications for practice, carry over to odd-degree local polynomials. Second, we 
show that with proper fixed-n Studentization, coverage error adapts to boundary points. We 
focus on what is novel relative to the density, chiefly variance estimation and boundary points. 
For interior points, the implications for coverage error, bandwidth selection, and interval length 
are all analogous to the density case, and we will not retread those conclusions. 

To be specific, throughout this section we focus on the case where the smoothness is large 
relative to the local polynomial degree p, which is arguably the most relevant case in practice. 
The results and discussion in Sections 2.2 and 2.3 carry over, essentially upon changing li to 
p+1 and t to q — p (or q — p + 1 for interior points with q even). Similarly, but with increased 
notational burden, the conclusions of Section 2.5 also remain true for this section. The present 
results also extend to multivariate data and derivative estimation. 

To begin, we define the regression estimator, its bias, and the bias correction. Given a 
random sample {(Y), X{) : 1 < i < n}, the local polynomial estimator of m(x) = E[Y)|Ah = x], 
temporarily making explicit the evaluation point, is 

n 

rh{x) = e' 0 (3 p , f3 p = arg min VVy - r p (X t - x)'b) 2 K 

6 S RP+i , =1 

where, for an integer p > 1, e 0 is the (p+ l)-vector with a one in the first position and zeros in 
the rest, and r p (?i) = (1, u,u 2 ,..., u p )'. We restrict attention to p odd, as is standard, though 
the qualifier may be omitted. We define Y = (Yi, • • • , Y n )', R p = [r p ((Ad— x)/h), • • • , r p ((A„ — 
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x)/h)]', W p = diag(/i -1 A'((Xj — x)/h) : i — 1,..., n), and = R'W p R p /n (diag(cq : i = 
1 ,... ,n) denotes the n x n diagonal matrix constructed using cq, a?, ■ ■ ■ , a n ). Then, reverting 
back to omitting the argument x, the local polynomial estimator is m = e' o r p 1 RpW p Y/n. 
Under regularity conditions below, the conditional bias satisfies 

E[m\Xi ,..., X n \ - m = h p+1 m {p+1) 1 A P + o P (/r p+1 ), (6) 

where A p = RpW p [((Xi — x)/h) p+1 , • • • , ((X n — x)/h) p+1 ]'/n. Here, the quantity e[,r“ 1 X P /(p+ 
1)! is random, unlike in the density case (c.f. (2)), but it is known and bounded in probability. 
Following Fan and Gijbels (1996, Section 4.4, p. 116), we will estimate mS p+l> in (6) using a 
second local polynomial regression, of degree q > p (even or odd), based on a kernel L and 
bandwidth b. Thus, r q (u), R y , W q , and r, ; are defined as above, but substituting q, L, and b 
in place of p, K, and h, respectively. Denote by e p+ i the (q + l)-vector with one in the p + 2 
position, and zeros in the rest. Then we estimate the bias with 

B m = V +1 m ( ^ 1) ^2_ T eir; 1 A p , = b^~\p + lJl^.r^'RjW.Y/n. 

Exactly as in the density case, B m introduces variance that is controlled by p and will be 
captured by robust bias correction. 

3.1 Variance Estimation 

The Studentizations in the density case were based on fixed-n expectations, and we will show 
that retaining this is crucial for local polynomials. The fixed-n versus asymptotic distinction 
is separate from, and more fundamental than, whether we employ feasible versus infeasible 
quantities. The advantage of fixed-n Studentization also goes beyond bias correction. 

To begin, we condition on the covariates so that 1 is fixed. Define v(-) = V[U|X = •] 
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and £ = diag(u(Aj) ,n). Straightforward calculation gives 


o' u S = (nh)Y[m\X U 


,W] = A;r; 


r;w p sw p r p ) r-^o. 


(7) 


One can then show that cr^ s —>p v(x)f(x)^ 1 V(K,p), with V(K,p) a known, constant function 
of the kernel and polynomial degree. Importantly, both the nonasymptotic form and the 
convergence hold in the interior or on the boundary, though V(K,p ) changes. 

To first order, one could use cr 2 s or the leading asymptotic term; all that remains is to 
make each feasible, requiring estimators of the variance function, and for the asymptotic form, 
also the density. These may be difficult to estimate when x is a boundary point. Concerned 
by this, Chen and Qin (2002, p. 93) consider feasible and infeasible versions but conclude 
that “an increased coverage error near the boundary is still the case even when we know 
the values of f(x) and v(x).” Our results show that this is not true in general: using fixed-n 
Studentization, feasible or infeasible, leads to confidence intervals with the same coverage error 
rates at interior and boundary points, thereby retaining the celebrated boundary carpentry 
property. 

For robust bias correction, a 2 bc = ( nh)V[m — B m \Xi ,..., X n ] captures the variance of rh 
and m ( ' p+1 ' ) as well as their covariance. A similar fixed-n calculation gives 


a. 


rbc — ?? e 0 r p a p,q) ^p e 0) a p,q ~ p P+ 




( 8 ) 


To make the fixed-n scalings feasible, cr( s and a^ hc take the forms (7) and (8) and replace 
£ with an appropriate estimator. First, we form v(Xi) — (Y t — r p (Xi — x)'(3 p ) 2 for or 
v(Xi) = (Yj — r q (Xi — x)'(3 q ) 2 for a 2 bc . The latter is bias-reduced because r p (Xi — x)'/3 p is a 
p-terrn Taylor expansion of m(Xi ) around x, and / 3 p estimates / 3 p (similarly with q in place of 
p), and we have q > p. Next, motivated by the fact that least-squares residuals are on average 
too small, we appeal to the HC k class of estimators (see MacKinnon (2013) for a review), 
which are defined as follows. First, cb; s -HC0 uses £ us = diag(h(A 7 " i ) : i — 1,... ,n). Then, d 2 s - 
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HC k, k = 1, 2, 3, is obtained by dividing v(Xi) by, respectively, (n — 2tr(Q p ) + tr(QpQ p ))/n, 
(1 — Qp,m), or (1-Q p ,ii) 2 , where Q p := R^R^W*/ n is the projection matrix and Q p .n its i- 
th diagonal element. The corresponding estimators d 2 bc -HC k are the same, but with q in place 
of p. For theoretical results, we use HCO for concreteness and simplicity, though inspection 
of the proof shows that simple modifications allow for the other HC k estimators and rates do 
not change. These estimators may perform better for small sample sizes. Another option is to 
use a nearest-neighbor-based variance estimators with a fixed number of neighbors, following 
the ideas of Muller and Stadtmuller (1987) and Abadie and Imbens (2008). Note that none of 
these estimators assume local or global homoskedasticity nor rely on new tuning parameters. 
Details and simulation results for all these estimators are given in the supplement, see §S.II.2.3 
and Table S.II.9. 

3.2 Higher Order Expansions of Coverage Error 

Recycling notation to emphasize the parallel, we study the following three statistics: 

y/nh{fh — m) Vnh(m — B m — m) \fnh{m — B m — m) 

us 7 ? be 7 i -L rbc 7 > 

^\is ^"us ^rbc 

and their associated confidence intervals J us , J bc , and J rbc , exactly as in Eqn. (5). Importantly, 
all present definitions and results are valid for an evaluation point in the interior and at the 
boundary of the support of X t . The following standard conditions will suffice, augmented 
with the appropriate Cramer’s condition given in the supplement to conserve space. 

Assumption 3.1 (Data-generating process). {(Yi, Ad),..., (Y n , X n )} is a random sample, 
where X t has the absolutely continuous distribution with Lebesgue density f, E[Y 8+<5 |A] < oo 
for some S > 0, and in a neighborhood of x, f and v are continuous and bounded away from 
zero, m is S > q + 2 times continuously differentiable with bounded derivatives, and m ^ is 
Holder continuous with exponent g. 
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Assumption 3.2 (Kernels). The kernels K and L are positive, bounded, even functions, and 
with compact support. 

We now give our main, generic result for local polynomials, analogous to Theorem 1. For 
notation, the polynomials qi, q 2 , and q 3 and the biases 7/ us and r/ bc , are cumbersome and exact 
forms are deferred to the supplement. All that matters is that the polynomials are known, 
odd, bounded, and bounded away from zero and that the biases have the usual convergence 
rates, as detailed below. 

Theorem 2. Let Assumptions 3.1, 3.2, and Cramer’s condition hold and nh/\og{nh) — > oo. 

(a) If r/ us log (rift,) —>■ 0, then 

P [m G J UB ] = 1 - a + |^9l, US + vlsQ2,us + -^=<?3,us j 0(^f ) {1 + o(l)}. 

(b) If r/bc log (rift,) —>■ 0 and p —* 0, then 

P [m G / bc ] = 1 - O + |^9l,ns + hbc?2,ns + -^=g3,us j 0(^f ) {1 + o(l)} 

+ fP +2 (Vl 1 be + /P +1 ^2,bc)0(^f )^| {1 + o(l)}. 

(c) If r/bc log (nh) —>■ 0 and p — * p < oo, then 

P[m G / rbc ] = 1 - a + | ~^9l,rbc + hbc^2,rbc + -^=<?3,rbc| ) {1 + o(l)}. 

This theorem, which covers both interior and boundary points, establishes that the conclu¬ 
sions found in the density case carry over to odd-degree local polynomial regression. (Although 
we focus on p odd, part (a) is valid in general and (b) and (c) are valid at the boundary for 
p even.) In particular, this shows that robust bias correction is as good as, or better than, 
undersmoothing in terms of coverage error. Traditional bias correction is again inferior due to 
the variance and covariance terms p p+2 (fli,bc + P p+1 ^ 2 ,bc)- Coverage error optimal bandwidths 
can be derived as well, and similar conclusions are found. Best possible rates are defined for 
fixed p here, the analogue of h above; see Section 2.2 for further discussion on smoothness. 
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Before discussing bias correction, one aspect of the undersmoothing result is worth men¬ 
tioning. The fact that Theorem 2 covers both interior and boundary points, without requiring 
additional assumptions, is in some sense, expected: one of the strengths of local polynomial 
estimation is its adaptability to boundary points. In particular, from Eqn. (6) and p odd it 
follows that r] us x \/nhh p+1 at the interior and the boundary. Therefore, part (a) shows that 
the decay rate in coverage error does not change at the boundary for the standard confidence 
interval (but the leading constants will change). This finding contrasts with the result of 
Chen and Qin (2002) who studied the special case p = 1 without bias correction (part (a) of 
Theorem 2), and is due entirely to the fixed-n Studentization. 

Turning to robust bias correction, we will, in contrast, find rate differences between the 
interior and the boundary, no matter the parity of q. As before, r] bc has two terms, representing 
the higher-order bias of the point estimator and the bias of the bias estimator. The former 
can be viewed as the bias if mS v+l ^ were zero, and since p + 1 is even, we find that it is of 
order \/nhh p+ 3 in the interior but \fnhh pJr2 at the boundary. The bias of the bias correction 
depends on both bandwidths h and b, as well as p and q, in exact analogy to the density case. 
For q odd, it is of order h p+1 b q ~ p at all points, whereas for q even this rate is attained at the 
boundary, but in the interior the order increases to h p+1 b q+1 ~ p . Collecting these facts: in the 
interior, r/ bc x \/nhh p+3 ( 1 + p~ 2 b q ~ p ~ 2 ) for odd q or with b q ~ p ~ l for q even; at the boundary, 
r/bc x Vnhh p+2 ( 1 + p~ l b q ~ p ~ l ). Further details are in the supplement. 

In light of these rates, the same logic of Section 2.2 leads us to restrict attention to bounded, 
positive p and q — p + 1, and thus even. Calonico et al. (2014, Remark 7) point out that in 
the special case of q — p + 1, K — L, and p — 1, m — B m is identical to a local polynomial 
estimator of order q: this is the closest analogue to M being a higher-order kernel. If the point 
of interest is in the interior, then q = p + 2 yields the same rates. 

For notational ease, let fj^ and be the leading constants for the interior and boundary, 
respectively, so that e.g. r/ bc = \fnFih p+?J [f/^ + o(l)] in the interior (exact expressions are in 
the supplement). We then have the following, precise result; the analogue of Corollary 2(a). 


24 


Corollary 4. Let the conditions of Theorem 2(c) hold, with p G (0, oo) and q = p + 1. 

(a) For an interior point, 

P [m G / rbc ] = l-a+|^gi,rbc + n/l 1+2(P+3) (f^f ) 2 g2,rbc + /i P+3 (^bc t )ferbc|(/'(-f) (l + o(l)}. 

(b) For a boundary point, 

P [m G / rbc ] = l-a+|^gi, r bc + n/i 1+2(P+2) (?7bc d ) 2 g2,rbc + ^ P+2 (?7bc d )?3,rbc|0(^f) {l+o(l)}. 

There are differences in both the rates and constants between parts (a) and (b) of this 
result, though most of the changes to constants are “hidden” notationally by the definitions 
of r/^ d and the polynomials rbc- Part (a) most closely resembles Corollary 2 due to the 
symmetry yielding the corresponding rate improvement (recall that fi in the density case is 
replaced with p + 1 here), and hence all the corresponding conclusions hold qualitatively for 
local polynomials. 

3.3 Practical Choices and Empirical Consequences 

As we did for the density, we now derive bandwidth choices, and data-driven implementations, 
to optimize coverage error in applications. 

Corollary 5. Let the conditions of Corollary f hold. 

(a) For an interior point, if h = h* hc = //* bc n _1// ( p+4 ) ? then P [m G / rbc ] = 1 — a + 
0(n”( p+3 )/(p +4 )) 7 where 

H" rbc = argmin|/f-y rbc + 

H> 0 

(b) For a boundary point, if h = h* bc = H* hc (p)n~ 1 ^ p+2 '\ then P [m G / rbc ] = 1 — a + 
0(n - ( p+2 )/( p+3 )), where 

H*bc{p) = argmin| //~V rbc + H l+2{ ^ +2) {rjlff q 2 , rb c + H p+2 (f]^ d )q 3)lhc \ 

H> 0 
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To implement these results, we first set p = 1 and the kernels K and L equal to any 
desired second order kernel, typical choices being triangular, Epanechnikov, and uniform. 
The variance estimator d rbc is defined in Section 3.1, and is fully implementable, and thus so 
is J rbc , once the bandwidth h is chosen. 

For selecting h at an interior point, the same conclusions from density estimation apply: 
(i) coverage of J rbc is quite robust with respect to h and p, (ii) feasible choices for h are easy to 
construct, and (iii) an MSE-optimal bandwidth only delivers the best coverage error for p = 1 
(that is, fi = 2 in the density case). On the other hand, for a boundary point, an interesting 
consequence of Corollary 5 is that an MSE-optimal bandwidth never delivers optimal coverage 
error decay rates, even for local linear regression: h* se oc 7 t, _1// ( 2 p+ 3 ) h* hc oc n~ 1 ^ p+d '\ 

Keeping this in mind, we give a fully data-driven direct plug-in (DPI) bandwidth selector 
for both interior and boundary points: = H^± n -l/(P+4) and ; ? bnd = #bnd n -l/0H-3) j w h ere 

HY* an d -^dpi are estimates of (the appropriate) H* bc of Corollary 5, obtained by estimating 
unknowns by pilot estimators employing a readily-available pilot bandwidth. The complete 
steps to form and H %are in the supplement, as is a second data-driven bandwidth 
choice, based on rescaling already-available MSE-optimal bandwidths. All our methods are 
available in the R package nprobust; see https: //cran.r-project. org/package=nprobust. 

4 Simulation Results 

We now report a representative sample of results from a simulation study to illustrate our 
findings. We drew 5,000 replicated data sets, each being n = 500 i.i.d. draws from the 
model Yj = m(X i ) + £j, with m(x) = sin(37nr/2)(l + 18x 2 [sgn(a;) + l]) -1 , Ah ~ 1/[0,1], 
and £i ~ AT(0,1). We consider inference at the Eve points x G {—2/3,—1/3, 0,1/3, 2/3}. 
The function m(x) and the five evaluation points are plotted in Figure 1; this function was 
previously used by Berry et al. (2002) and Hall and Horowitz (2013). The supplement gives 
results for other models, bandwidth selectors and their simulation distributions, alternative 
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variance estimators, and more detailed studies of coverage and length. 

We compared robust bias correction to undersmoothing, traditional bias correction, the off- 
the-shelf R package locf it (Loader, 2013), and the procedure of Hall and Horowitz (2013). In 
all cases the point estimator is based on local linear regression with the data-driven bandwidth 
fcg, which shares the rate of h mse in this case, and p — 1. The locf it package has a bandwidth 
selector, but it was ill-behaved and often gave zero empirical coverage. Hall and Horowitz 
(2013) do not give an explicit optimal bandwidth, but do advocate a feasible /r mse , following 
Ruppert et al. (1995). To implement their method, we used 500 bootstrap replications and 
we set 1 — £ = 0.9 over a sequence {aq, ...,xn} = {—0.9, —0.8,..., 0,..., 0.8, 0.9} to obtain 
the final quantile d^(a 0 ), and used their proposed standard errors <3^ H = na 2 /f x , where 
d 2 = YJi =i ^H n for £i = £i~ £, with e i = Y i ~ m(JQ) and e = YJi=i ^/ n - 

Table 1 shows empirical coverage and average length at all five points for all five methods. 
Robust bias correction yields accurate coverage throughout the support; performance of the 
other methods varies. For x = —2/3, the regression function is nearly linear, leaving almost 
no bias, and the other methods work quite well. In contrast, at x — —1/3 and x = 0, all 
methods except robust bias correction suffer from coverage distortions due to bias. Indeed, 
Hall and Horowitz (2013, p. 1893) report that “[t]he ‘exceptional’ 100£% of points that are 
not covered are typically close to the locations of peaks and troughs, [which] cause difficulties 
because of bias.” Finally, bias is still present, though less of a problem, for x — 1/3 and 
x = 2/3, and coverage of the competing procedures improves somewhat. Motivated by the 
fact that the data-driven bandwidth selectors may be “too large” for proper undersmoothing, 
we studied the common practice of ad-hoc undersmoothing of the MSE-optimal bandwidth 
choice h mse : the results in Table S.II.8 of the supplement show this to be no panacea. 

To illustrate our findings further, Figures 2(a) and 2(b) compare coverage and length 
of different inference methods over a range of bandwidths. Robust bias correction delivers 
accurate coverage for a wide range of bandwidths, including larger choices, and thus can 
yield shorter intervals. For undersmoothing, coverage accuracy requires a delicate choice of 
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bandwidth, and for correct coverage, a longer interval. Figure 2(c), in color online, reinforces 
this point by showing the “average position” of J us (h) and / rbc (h) for a range of bandwidths: 
each bar is centered at the average bias and is of average length, and then color-coded by 
coverage (green indicates good coverage, fading to red as coverage deteriorates). These results 
show that when J us is short, bias is large and coverage is poor. In contrast, J rbc has good 
coverage at larger bandwidths and thus shorter length. 

5 Conclusion 

This paper has made three distinct, but related points regarding nonparametric inference. 
First, we showed that bias correction, when coupled with a new standard error formula, per¬ 
forms as well or better than undersmoothing for confidence interval coverage and length. 
Further, such intervals are more robust to bandwidth choice in applications. Second, we 
showed theoretically when the popular empirical practice of using MSE-optimal bandwidths 
is justified, and more importantly, when it is not, and we gave concrete implementation rec¬ 
ommendations for applications. Third, we proved that confidence intervals based on local 
polynomials do have automatic boundary carpentry, provided proper Studentization is used. 
These results are tied together through the themes of higher order expansions and the im¬ 
portance of finite sample variance calculations and the key, common message that inference 
procedures must account for additional variability introduced by bias correction. 
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Table 1: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Evaluation 

Point 

Average 

Bandwidth 

Empirical Coverage 

Interval Length 

US 

Locfit 

BC 

HH 

RBC 

US 

Locfit 

HH 

RBC 

—2/3 

0.166 

94.8 

94.4 

81.8 

93.5 

93.7 

0.505 

0.544 

0.479 

0.722 

-1/3 

0.283 

56.5 

70.7 

80.6 

48.2 

92.8 

0.380 

0.409 

0.316 

0.540 

0 

0.318 

74.4 

83.7 

80.3 

61.1 

92.6 

0.354 

0.383 

0.279 

0.507 

1/3 

0.370 

89.9 

92.1 

78.5 

78.4 

92.9 

0.327 

0.356 

0.241 

0.470 

2/3 

0.265 

93.9 

93.9 

81.3 

88.4 

93.6 

0.391 

0.425 

0.339 

0.562 


Notes: (i) Column “Average Bandwidth” reports simulation average of estimated bandwidths h = h dpi = 
h\ “t. Simulation distributions for estimated bandwidths are reported in the supplement, (ii) US = Under¬ 
smoothing, Locfit = R package locfit by Loader (2013), BC = Bias Corrected, HH = Hall and Horowitz 
(2013), RBC = Robust Bias Corrected. 


Figure 1: True Regression Model and Evaluation Points 
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Empirical Coverage 


Figure 2: Local Polynomial Simulation Results for x = 0 
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Supplement to “On the Effect of Bias Estimation on 
Coverage Accuracy in Nonparametric Inference” 


This supplement contains technical and notational details omitted from the main text, 
proofs of all results, further technical details and derivations, and additional simulations re¬ 
sults and numerical analyses. The main results are Edgeworth expansions of the distribution 
functions of the t-statistics T us , T bc , and T rbc , for density estimation and local polynomial 
regression. Stating and proving these results is the central purpose of this supplement. The 
higher-order expansions of confidence interval coverage probabilities in the main paper follow 
immediately by evaluating the Edgeworth expansions at the interval endpoints. 

Part S.I contains all material for density estimation at interior points, while Part S.II 
treats local polynomial regression at both interior and boundary points, as in the main text. 
Roughly, these have the same generic outline: 

• We first present all notation, both for the estimators themselves and the Edgeworth 
expansions, regardless of when the notation is used, as a collective reference; 

• We then discuss optimal bandwidths and other practical matters, expanding on details 
of the main text; 

• Assumptions for validity of the Edgeworth expansions are restated from the main text, 
and Cramer’s condition is discussed; 

• Bias properties are discussed in more detail than in the main text, and some things 
mentioned there are made precise; 

• The main Edgeworth expansions are stated, some corollaries are given, and the proofs 
are given; 

• Complete simulation results are presented. 


All our methods are implemented in software available from the authors’ websites and via 
the R package nprobust available at https ://cran.r-project. org/package=nprobust. 
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Part S.I 

Kernel Density Estimation and 
Inference 


S.I.l Notation 

Here we collect notation to be used throughout this section, even if it is restated later. 
Throughout this supplement, let X h l = (x — Xj)/h and similarly for X The evaluation 
point is implicit here. In the course of proofs we will frequently write s = \fnh. 

S.I. 1.1 Estimators, Variances, and Studentized Statistics 

To begin, recall that the original and bias-corrected density estimators are 

1 n 

'<*> = ;* £*<*«> 
i= 1 

and 

1 n 

f-B f =—J2 M ( x h,i ), M(u) := K(u) - p 1+rt L^\pu)ii K/l , (9) 

2=1 

for symmetric kernel functions K(-) and L(-) that integrate to one on their compact support, 
h and b are bandwidth sequences that vanish as n —>■ oo, and where 

1 n 

P=h/b, B f = h‘p\x)^ K/ „ = 

2=1 

and integrals of the kernel are denoted 

pK,k = ~ ^ j u k K(u)du, and '0 km = j K(u) k du. 

The three statistics T us , T bc , and T rbc share a common structure that is exploited to give 
a unified theorem statement and proof. For v G {1,2}, define 

1 n 

fy = V (V h)i ), where iVi(w) = A' (u) and N 2 {u ) = M(u), 

2=1 


3 



and M is given in Eqn. (9). Thus, fi = f and / 2 — f — Bf. In exactly the same way, define 
a 2 v := nhV[f v ] = ^ {E [N v (X m ) 2 ] - E [N v (.X hji )] 2 } 
and the estimator 


-,2 _ 1 J 1 

” h 1 n 


E [ N * (**,o 2 ] 


i= 1 


-i 2 


n 


J2 N A x h,i) 


i— 1 


The statistic of interest for the generic Edgeworth expansion is, for 1 < w < v < 2, 
Vnh(fv ~ f ) 


T ■ = 

11 in • 


(7 qi 


In this notation, 


T us — Ti 


T bc = T 2 j i, and T rbc = T 2 , 2 . 


S.1.1.2 Edgeworth Expansion Terms 

The scaled bias is r/ v = \/nh(K[f v \ — /). The Standard Normal distribution and density 
functions are $(z) and 4>(z), respectively. 

The Edgeworth expansion for the distribution of T VjW will consist of polynomials with 
coefficients that depend on moments of the kernel(s). To this end, continuing with the generic 
notation, for nonnegative integers j, k,p, define 


'y VJ> = h- 1 -E[N v (X h>i ) p \, 


= - E { N v (X h ,i) j ~ E 

i =1 


n v (x hti y 


and 


VvM M = [i N v (X h ,i) - E [Nv c X h ,i))) j (N w {X Ki ) p - E [N w (. X h ^ p ]) k 


We abbreviate v v , w {j,0,p) = u v (j). 

To expand the distribution function, additional polynomials are needed beyond those used 
in the main text for coverage error. These are 


Pvl 0 ) = Hz)°w 3 Wv,w( 1 ’ h 2 )- 7 2 - ^(3)(- 2 - l)/6], 

Pvl( z ) = -0(7^7 E [/™K,7h h !)- 2 > and pi%( z ) = H^w 1 - 
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Next, recall from the main text the polynomials used in coverage error expansions, here with 
an explicit argument for a generic quantile z rather than the specific z a / 2 : 


q\{z\ K) = x) ka( z 3 - 3z)/6 - 'd h 3 2 i9 2 K:j [2z 3 /3 + (z 5 - 10^ 3 + 15z)/9], 
q 2 (z; K) = -tf^ 2 (z), and q 3 (z ; K) = i9^ 2 2 'd Kj3 (2z 3 /3). 

The corresponding polynomials for expansions of the distribution function are 

qS% 0 ) = ^^Y~Qk(z;N w ), k = 1,2,3. 

Finally, the precise forms of fR and fl 2 are: 


fii = —2 \ | f ~ uh)K{u)L [ ' h \up)du — b J f{x — uh)K{u)du J f{x — ub)L^\u)d 


u 


and fl 2 = K 2 ^These only appear for T bc , and so are not indexed by {u,ta}. 

All these are discussed in Section S.I.6. 


S.I.2 Details of practical implementation 

We maintain i = 2 and recommend fi = 2. For the kernels K and L, we recommend either 
the second order minimum variance (to minimize interval length) or the MSE-optimal kernels; 
see Sections S.I.2.3 and S.I.4.2. In the next two subsections we discuss choice of h and p. 

As argued below in Section S.I.2.3, we shall maintain p — 1. In the main text we give a 
direct plug-in (DPI) rule to implement the coverage-error optimal bandwidth. Here we we 
give complete details for this procedure as well as document a second practical choice, based 
on a rule-of-thumb (ROT) strategy. Both choices yield the optimal coverage error decay rate 

0 f ji~ (^+ 2 )/(l+(^+ 2 ))_ 

All our methods are implemented in software available from the authors’ websites and via 
the R package nprobust available at https ://cran.r-project. org/package=nprobust. 

Remark 1 (Undercoverage of / us (h* se )). It is possible not only to show that / us (h* se ) asymp¬ 
totically undercovers (see Hall and Horowitz (2013) for discussion in the regression context) 
but also to quantify precisely the coverage. To do so, write T us = \/nh(f — E[/])/d us +77 us /cf us , 
where the first term will be asymptotically standard Normal and the second will be a nonva¬ 
nishing bias. To characterize the bias, recall from Eqn. (10) and Section S.I.l that r] ns = 
Vnhh k [p K ^f^ + o(l)] and a 2 = i?iy 2 /[l + op(l)]. Therefore, plugging in (h* se ) 1+2/? = 
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^K, 2 f(l^K/,f W ) 2 /n shows that rj ns /a us = 1 + op(l), whence T us (h* se ) -+ d N(l, 1). For exam¬ 
ple, if a = 0.05, P[/ e / us (/'4 e )] ~ 0.83. ■ 

S.I.2.1 Bandwidth Choice: Rule-of-Thumb (ROT) 

Motivated by the fact that estimating H dpi might be difficult in practice, while data-driven 
MSE-optimal bandwidth selectors are readily-available, the ROT bandwidth choice is to sim¬ 
ply rescale any feasible MSE-optimal bandwidth /r mse to yield optimal coverage error decay 
rates (but sub-optimal constants): 

f, - h „-(A-2)/((l+2A)(A+3)) 

,L Tot li mse 

When k — 2, h rot = h mse , which is optimal (in rates) as discussed previously. 

Remark 2 (Integrated Coverage Error). A closer analogue of the Silverman (1986) rule of 
thumb, which uses the integrated MSE, would be to integrate the coverage error over the point 
of evaluation x. For point estimation, this approach has some practical benefits. However, in 
the present setting note that f f^\x)dx = 0, removing the third term (of order h k ) entirely 
and thus, for any given point x, yields a lower quality approximation. ■ 


S.I.2.2 Bandwidth Choice: Direct Plug-In (DPI) 

To detail the direct plug-in (DPI) rule from the main text, it is useful to first simplify the 
problem. Recall from the main text that the optimal choice is h* hc = H* bc (p)7i~ 1 ^ li+3 \ where 

H* bc (K,L,p) = argnun|R“ 1 gi(Mp) + H 1+2{h+2 \f (k+2) ) 2 (p K ,n +2 + P~ 2 I^Kfil^Lp) 2 q 2 {M p ) 

H 

+ H r,+ 2 fd l+2 ) (/i A ' j/i+2 + P 2 P>K,k^Lp) q.s{Mp)\. 


With l = 2 and p — 1, and using the definitions of qk(Mi), k = 1, 2, 3, from the main text or 
Section S.1.1.2, this simplifies to: 


H^(K,L,1) 


arg mm 

H 


H 


-l 




M, 4 


^ ~ 3 t q2 
Q V M, 3 


— 4 z 2 + 15 


— }{ l +‘ 2 d+ 2 ) ((/(/ ,+2 )) 2 (p K fi + 2 + APy/tARg ) 2 i^Mp} 
+ H L+ ~ |/ l ' / ' +2 ' ) {pKfi+2 + pK/t^Lp) j 
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where z = z a / 2 the appropriate upper quantile of the Normal distribution. However, H* hc (p) 
still depends on the unknown density through f( ll+2 \ 

Our recommendation is a DPI rule of order one, which uses a pilot bandwidth to estimate 
/(/J+ 2 ) consistently. A simple and easy to implement choice is the MSE-optimal bandwidth 
appropriate to estimating f^ fi+2 \ say hl +2mse , which is different from /r* se for the level of 
the function; see e.g., Wand and Jones (1995). Let us denote a feasible MSE-optimal pilot 
bandwidth by h^ +2 , mse- Then we have: 


H dpi (K, L, 1) = arg min 


H 


H~ l { d 


z 2 — 3 


'M,4- 


- d 2 

U M, 3 


z 4 — 4 z 2 + 15 


6 9 

— iL 1+2 ( /?+2 ) |/ t/,+2 ^(x; hn + 2 } mse) 2 {fJ>K,l+2 + 

+ H L+ ~ I f (ll+2 \x] hr,+ 2 , mse) (dK/,+2 + dKfidLfi) ^ 


This is now easily solved numerically (see note below). Further, if h — 2, the most common 
case in practice, and K and L are either the respective second order minimum variance or 
MSE-optimal kernels (Sections S.I.2.3 and S.I.4.2), then the above may be simplified to: 


H dpi (M, 1) = arg min 
H 


H~Ud 


'M, 4“ 


d 2 

U M, 3 


£ 4 - 4z 2 + 15 


H 9 j/ (4) (u hr, +2>mse ) 2 /J, 2 MA d M ,‘2 


+ H 4 \ /^ 4) (x; ^+2,m S e)^M,4^M,3^- 


Continuing with k — 2, a second option is a DPI rule of order zero, which uses a refer¬ 
ence model to build the rule of thumb, more akin to Silverman (1986). Using the Normal 
distribution, so that f(x) = 4>{x) and derivatives have known form, we obtain: 


1) 


arg mm 

H 


H 


-1 


d 


M, 4“ 


d 2 
U M, 3 


£ 4 - 4:Z 2 + 15 
9 



■ 6X 2 + 3) <j>(x)] 2 fJ> 2 Mt 4 &M ,2 } 

2z 2 1 

6x 2 + 3) (,t>{x)nM,4d M ,z — | 


where x = (x — jj)/<7x is the point of interest centered and scaled. 


Remark 3 (Notes on computation). When numerically solving the above minimization prob- 
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lems, computation will be greatly sped up by squaring the objective function. 


S.I.2.3 Choice of p 

First, we expand on the argument that p should be bounded and positive. Intuitively, the 
standard errors a 2 hc control variance up to order (■ nh ) , while letting 5 —)• 0 faster removes 
more bias. If b vanishes too fast, the variance is no longer controlled. Setting p G (0, oo) 
balances these two. Let us simplify the discussion by taking i = 2, reflecting the widespread 
use of symmetric kernels. This does not affect the conclusions in any conceptual way, but 
considerably simplifies the notation. With this choice, Eqn. (9) yields the tidy expression 

Vbc = Vnhh l,+2 f {li+2) {pka +2 ~ p~ 2 PkaPlp) {1 + o(l)}. 

Choice of i and b (or p) cannot reduce the first term, which represents E [f] — f — Bf, and 
further, if p — oo, the bias rate is not improved, but the variance is inflated beyond order 
(■ nh ) _1 . On the other hand, if p — 0, then not only is a delicate choice of b needed, but 
l > 2 is required, else the second term above dominates r/ bc , and the full power of the variance 
correction is not exploited; that is, more bias may be removed without inflating the variance 
rate. Hall (1992b, p. 682) remarked that if E [f] — f — Bf is (part of) the leading bias term, then 
“explicit bias correction [... ] is even less attractive relative to undersmoothing.” We show 
that, on the contrary, when using our proposed Studentization, it is optimal that E[/] — f — Bf 
is (part of) the dominant bias term. This reasoning is not an artifact of choosing fi even and 
i = 2, but in other cases p —> 0 can be optimal if the convergence is sufficiently slow to 
equalize the two bias terms. 

The following result which makes the above intuition precise. 

Corollary 6 (Robust bias correction: p —>■ 0). Let the conditions of Theorem 3(c) hold, with 
p — 0, and fix d = 2 and A < S — 2. Then 

E[/ e ^rbc] = l-a + |^?1 W + nh 1+2 ^ +2 \f {,i+2) ) 2 (p 2 KA+2 + p“Vl^h!, 2 ) ? 2 {K) 

1 (t)(Z2L ) 

+ /7/‘+ 2 /F+d (pK,H+2 + P 2 PK,HPL,2) 53(^0 I*- -J— {1 + o(l)} 

By virtue of our new studentization, the leading variance remains order {nh)~ l and the 
problematic correlation terms are absent, however by forcing p —> 0, the p~ 2 terms of r/ bc are 
dominant (the bias of Bf), and in light of our results, unnecessarily inflated. This verifies that 
p = 0 or 00 will be suboptimal. 



We thus restrict to bounded and positive, p. Therefore, p impacts only the shape of 
the “kernel” M p (u) = K(u ) — p l+k L^ k \pu)p,K,ki and hence the choice of p depends on what 
properties the user desires for the kernel. It happens that p — 1 has good theoretical properties 
and performs very well numerically (see Section S.I.8). As a result, from the practitioner’s 
point of view, choice of p (or b ) is completely automatic. 

To see the optimality of p = 1, consider two cogent and well-studied possibilities: finding 
the kernel shape to minimize (i) interval length and (ii) MSE. The following optimal shapes 
are derived by Gasser et al. (1985) and references therein. Given the above results, we set 
h = 2. Indeed, the optimality properties here do not extend to higher order kernels. 

Minimizing interval length is (asymptotically) equivalent to finding the minimum variance 
fourth-order kernel, as cr 2 bc —» f$Mp- Perhaps surprisingly, choosing K and L to be the 
second-order minimum variance kernels for estimating / and /^ respectively, yields an M\ (u) 
that is exactly the minimum variance kernel. The fourth order minimum variance kernel for 
estimating / is K mv (u) = (3/8)(—5 u 2 + 3), which is identical to M\{u) when K is the uniform 
kernel and l/ 2) = (15/4)(3w 2 — 1), the minimum variance kernels for / and /respectively. 

The result is similar for minimizing MSE: choosing K and to be the MSE-optimal 
kernels for their respective point estimation problems yields an MSE-optimal Mi(u). The 
optimal fourth order kernel is K mse (u) = (15/32)(7w 4 — 10w 2 + 3), and the respective second- 
order MSE optimal kernels are K(u) = (3/4)(1 — w 2 ) and L^ 2 \u) = (105/16)(6w 2 — 5w 4 — 1). A 
practitioner might use the MSE-optimal kernels (along with h* se ) to obtain the best possible 
point estimate. Our results then give an accompanying measure of uncertainty that both has 
correct coverage and the attractive feature of using the same effective sample. 

In Section S.I.4.2 we numerically compare several kernel shapes, focusing on: (i) inter¬ 
val length, measured by 'Omp, (ii) bias, given by [1ma, and (iii) the associated MSE, given by 
i'&MpPM 4 ,Y^ 9 ■ These results, and the discussion above, give the foundations for our recommen¬ 
dation of p = 1, which delivers an easy-to-implement, fully automatic choice for implementing 
robust bias-correction that performs well numerically, as in Section S.I.8. 

Remark 4 (Coverage Error Optimal Kernels). Our results hint at a third notion of optimal 
kernel shape: minimizing coverage error. This kernel, for a fixed order fi , would minimize the 
constants in Corollary 1 of the main text. In that result, h is chosen to optimize the rate and 
the constant H* s gives the minimum for a fixed kernel K. A step further would be to view 
H* s as a function of K , and optimizing. To our knowledge, such a derivation has not been 
done and may be of interest. ■ 
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S.I.3 Assumptions 


Copied directly from the main text (see discussion there), the following assumptions are 
sufficient for our results. 


Assumption S.I.3.1 (Data-generating process). {Xi,...,X n } is a random sample with an 
absolutely continuous distribution with Lebesgue density f. In a neighborhood of x, f > 0, / 
is S-times continuously differentiable with bounded derivatives f^ s \ s = 1,2, • • • , S, and f ^ 
is Holder continuous with exponent g. 

Assumption S.I.3.2 (Kernels). The kernels K and L are bounded, even functions with sup¬ 
port [—1,1], and are of order h > 2 and I > 2, respectively, where h and I are even integers. 
That is, pk,o — 1? MK,k = 0 for 1 <k<k, and p K ^ 0 and bounded, and similarly for p L k 
with l in place of h. Further, L is fi-times continuously differentiable. For all integers k and 
l such that k + l = h — 1, f^ k \x o)L^((;ro — x)/b) = 0 for xo in the boundary of the support. 

It will cause no confusion (as the notations never occur in the same place), but in the 
course of proofs we will frequently write s = \fnh. 

Assumption S.I.3.3 (Cramer’s Condition). For each f > 0 and all sufficiently small h 


sup 

tsM 2 , 


exp {i(t\M{u) + t 2 M (u) 2 )} f (x 


uh)du 


< 1 - C(x,£)h, 


where C(x,f) >0 is a fixed constant and i = y—i. 

Remark 5 (Sufficient Conditions for Cramer’s Condition). Assumption S.I.3.3 is a high level 
condition, but one that is fairly mild. Hall (1991) provides a primitive condition for As¬ 
sumption S.I.3.3 and Lemma 4.1 in that paper verifies that Assumption S.I.3.3 is implied. 
Hall (1992a) and Hall (1992b) assume the same primitive condition. This condition is as 
follows. On their compact support, assumed here to be [—1,1], there exists a partition 
— 1 = a 0 < aq < • • • < a m — 1, such that on each (aj_i,aj), K and M are differentiable, 
with bounded, strictly monotone derivatives. 

This condition is met for many kernels, with perhaps the only exception of practical 
importance being the uniform kernel. As Hall (1991) describes, it is possible to prove the 
Edgeworth expansion for the uniform kernel using different methods than we use in below. 
The uniform kernel is also ruled out for local polynomial regression, see Remark 9. ■ 
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S.I.4 Bias 


This section accomplishes three things. First, we first carefully derive the bias of the initial 
estimator and the bias correction. Second, we explicate the properties of the induced kernel 
M p in terms of bias reduction and how exactly this kernel is “higher-order”. Finally, we 
examine two other methods of bias reduction: (i) estimating the derivatives without using 
derivatives of kernels (Singh, 1977), and (ii) the generalized jackknife approach (Schucany 
and Sommers, 1977). Further methods are discussed and compared by Jones and Signorini 
(1997). The message from both alternative methods echoes our main message: it is important 
to account for any bias correction when doing inference, i.e., to avoid the mismatch present 
in T bc . 


S.I.4.1 Precise Bias Calculations 

Recall that the biases of the two estimators are as follows: 


m -f = { 


h k f^ix KA + ^ +2 / ( " +2 W/,+2 + O(h^) if k<S -2 
h k f^n K fi + 0(h s +<) if k e {S - 1, S} 

0 + 0 (h s+ *) if k > s 


( 10 ) 


and 


E [f-B f \-f={ 


h k+2 f^li KA+2 + hWf^lXK^L/ + o(^+ 2 + hW) if k + s<S 

h&+ 2 f(n+ 2 ) 2 _|_ 0(/r / '’6 5_/l+? ) + o(h r,+2 ) if 2 < S — fi < s 

0(/r 5+? ) + 0(/i*6 s - A+? ) if k e {S - 1, S} 

0(h s+(; ) + 0(h k b s - k ) if k > S. 

( 11 ) 


The following Lemma gives a rigorous proof of these statements. 

Lemma 1. Under Assumptions S. 1.3.1 and S.I.3.2, Equations (10) and (11) hold. 
Proof. To show Eqn. (10), begin with the change of variables and the Taylor expansion 


E[/] = h~ l / K (X h}i ) f(Xi)dXi = / K(u)f(x - uh)du 


{(-h) k f {k \x) f u k K (u)du/k\\ + (- h) s ( u s K (u) (f is \x) - f {s \x)) du. 
k =0 1 ) J 
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where x G [x,x — uh ]. By the Holder condition of Assumption S.1.3.1, the final term is 
0(/i S_K ). If fi > S, then all f u k K{u)du = 0, and only this remainder is left. In all other 
cases, h h f ( ' h \x)nKfi is the first nonzero term of the summation, and hence the leading bias 
term. Further, by virtue of k being even and K symmetric, f u h+l K{u)du = 0, leaving only 
0(/i s ’" K ) when k = S— 1, and otherwise, when k < S— 2, leaving h k+2 f^ +2 \x)iiK,k+ 2 +o(hf L+2 ). 
This completes the proof of Eqn. (10). 

To establish Eqn. (11), first write 

E [/ -B,]-f = E [/ ~f-B,\+ E [B, - B, ], 


where Bf follows the convention of being identically zero if fi > S. The first portion is 
characterized by rearranging Eqn. (10), so it remains to examine the second term. Let k = 
k\J S. By repeated integration by parts, using the boundary conditions of Assumption S.I.3.2: 


E [f W ] 


b 1+li 


&!+(*-!) 


(x M ) /( x,)dx, 

rW -11 (A' m )/(V) 


+ 


X 


J1+(A-1) 


£ (A_1) (X^f^iX^dXi 


0 + ^ziy J (X W ) f^X^dXi 

~ (*w) /“’«) + jiriU / ( a m) f m (x,)dx, 


= -X J L ( ‘-‘>(u)f m (x-ub)du, 

where the last line follows by a change of variables. We now proceed separately for each case 
delineated in (11), from top to bottom. For k > S, no reduction is possible, and the final line 
above is 0(b s ~ k ), and with Bf = 0, we have E [Bf — Bf} = 0 — h k = 0(h k b s ~^), as 
shown. For k < S, by a Taylor expansion, the final line displayed above becomes 

£ { 6 fc -V (fc) (*W-/?} + b s ~ R f u s - k L(u) (f s \x) - f( s \x)) du. 
k=k J 

The second term above is 0{b s ~ k+<; ) in all cases, and Hl,o = 1, which yields E[/^ /!) ] = + 

0(6 s '^ +? ) for k G {S — 1, S}, using fi L ^ = 0 in the former case. Next, if k + £ < S, the above 
becomes E[/^)] = fW + b e f^ +e ^HL,e + o(b ( ), as / iL,k = 0 for 1 < k < £, whereas if k + £ > S, 
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the remainder terms can not be characterized, leaving E[/^ /!) ] = + 0(6 S-A+S ). Plugging 

any of these results into E [Bf — Bf] = h k p,K,k{f^ ~ E[/^^]) completes the demonstration of 
Eqn. (11). □ 


S.I.4.2 Properties of the kernel M p ( ) 

As made precise below, M p is a higher-order kernel. The choices of K , L , and p determine 
the shape of M p , which in turn effects the variance and bias constants. In standard kernel 
analyses, these constants are used to determine optimal kernel shapes for certain problems 
(see Gasser et al. (1985) and references therein). For several choices of K , L , and p, Table 
S.I.l shows numerical results for the various constants of the induced kernel M p . The table 
includes (i) the variance, given by $m ,2 and relevant for interval length, (ii) a measure of 
bias given by p,M, 4 , and finally (iii) the resulting mean square error constant, ['$m2^M4] 1//9 
(P>m ,4 — (&!)(—l) fc /iM, 4 )- These specific constants are due to M p being a fourth order kernel, 
as discussed next, and would otherwise remain conceptually the same but rely on different 
moments. A more general, but more cumbersome procedure would be to choose p numerically 
to minimize some notation of distance (e.g., L^) between the resulting kernel M p and the 
optimal kernel shape already available in the literature. However, using p = 1 as a simple 
rule-of-thumb exhibits very little lost performance, as shown in the Table and discussed in the 
paper. 

It is worthwhile to make precise the sense in which the n -varying “kernel” M p (•) of Eqn. 
(9) is a higher-order kernel. Comparing Equations (10) and (11) shows exactly what is meant 
by this statement: the bias rate attained agrees with a standard estimate using a kernel of 
order k + 2 (if p > 0), as l > 2. For example, if k = £ = 2 and p > 0, then M p (-) behaves as 
a fourth-order kernel in terms of bias reduction. 

However, it is not true in general that M(-) is a higher-order kernel in the sense that its 
moments below k + 2 are zero. That is, for any k < k, by the change of variables w = pu, 


i: 


u k M(u)du— I u k K(u)du — p 1+k pKfi / u k L^\pu)du 


1 

-1 


-i-fc 


= 0 — p 1+h PK/tP 1 k / w k L^\w)du 
J-p 
rp 

'-p 


= 0 -p ri ~ k p K fi / w k L^\w)du. 


Now, L(u ) = L(—u ) implies that L( k \u) = (—1 ) k L^ k \—u). Since k is even, L^ k \w) is sym¬ 
metric, therefore if k is odd 0 = J^ p w k L^(w)du for any p. But this fails for k even, even 
for p — 1, and hence f_ t u k M{u)du 7 ^ 0. For example, in the leading case of k = t = 2, 
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Table S.I.l: Numerical results for bias and variance constants of the induced higher-order kernel M for several choices of K , L, and 
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The constants hm ,4 and $m ,2 measure bias and variance, respectively (the latter also being relevant for interval length). The 
MSE is measured by owing to M p being a fourth-order kernel. 







u 2 M(u)du 7 ^ 0 in general, and so M(-) is not a fourth-order kernel in the traditional sense. 

Instead, the bias reduction is achieved differently. The proof of Lemma 1 makes explicit 
use of the structure imposed by estimating fw using the derivative of the kernel L(-). From a 
technical standpoint, an integration by parts argument shows how the properties of the kernel 
!/(•) (not the function L^(-)) are used to reduce bias. This argument precedes the Taylor 
expansion of /, and thus moments of M are never encountered and there is no requirement 
that they be zero. This approach is simple, intuitive, and leads to natural restrictions on the 
kernel L, and for this reason it is commonly employed in the literature and in practice (Hall, 
1992b). 

S.I.4.3 Other Bias Reduction Methods 

We now examine two other methods of bias reduction: (i) estimating the derivatives with¬ 
out using derivatives of kernels (Singh, 1977), and (ii) the generalized jackknife approach 
(Schucany and Sommers, 1977). Further methods are discussed and compared by Jones and 
Signorini (1997). Both methods are shown to be tightly connected to our results. Further, a 
more general message is that it is important to account for any bias correction when doing 
inference, i.e., to avoid the mismatch present in T bc . 

The first method, which dates at least to Singh (1977), is to introduce a class of kernel 
functions directly for derivative estimation, more closely following the standard notion of 
a higher-order kernel rather than using the derivative of a kernel to estimate the density 
derivative and proving bias reduction via integration by parts. Jones (1994) expands on this 
method and gives further references. This class of kernels is used in the derivation of optimal 
kernel shapes (for derivative estimation) by Gasser et al. (1985). It is worthwhile to show how 
this class of kernel achieves bias correction and how this approach fits into our Edgeworth 
expansions. 

Consider estimating with 

1 n 

i=1 

for some kernel function J(-). Note well that J is generic, it need not itself be a derivative, 
but this is the only difference here. A direct Taylor expansion (i.e. without first integrating 
by parts) then gives 

s 

E[/W] = b~ k Y, b k ^J,kf (k) + 0(b s+ <). 

k =0 
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Thus, if J satisfies pj^ = 0 for k — 0,1,..., A — 1, h + 1, h + 2,..., h + [l — 1), pjj = 1, and 
1-ijA+e 0, and S is large enough then 

E[/ (A) ] = / (ri) + b l f^p w + o(b e ), 

just as achieved by /W and exactly matching Eqn. (10). Note that /ij ;0 = 0, that is, the 
kernel J does not integrate to one. In the language of Gasser et al. (1985), J is a kernel of 
order (k,k + £). 

Given this result, bias correction can of course be performed using f^\x) (based on J) 
rather than (based on L^). Much will be the same: the structure of Eqn. (9) will hold 
with J in place of and the results in Eqn. (11) are achieved with modifications to the 

constants (e.g., in the first line, p.jA+e appears in place of Pl,i)- In either case, the same 
bias rates are attained. Our Edgeworth expansions will hold for this class under the obvious 
modifications to the notation and assumptions, and all the same conclusions are obtained. 

When studying optimal kernel shapes, Gasser et al. (1985) actually further restrict the 
class, by placing a limit on the number of sign changes over the support of the kernel, which 
ensures that the MSE and variance minimization problems have well-defined solutions. Col¬ 
lectively, these differences in the kernel classes explain why it is possible to demonstrate 
“super-optimal” MSE and variance performance for certain choices of K } L^\ and p, as in 
Table S.I.l. 

A second alternative is the generalized jackknife method of Schucany and Sommers (1977), 
and expanded upon by Jones and Foster (1993). To simplify the notation and ease exposition, 
we describe this approach for second order kernels (fi = 2 ), but the method, and all the 
conclusions below, generalize fully. We thank an anonymous reviewer for encouraging us to 
include these details. 

Begin with two estimators f\ and f 2 , with (possibly different) bandwidths and second-order 
kernels hj and Kj , j = 1,2; thus Eqn. (10) gives 

E ifj] - f ( x ) = ^/ (2 W,-,2 + o(^ 2 ), j = 1, 2. 

Schucany and Sommers (1977) propose to estimate / with / G j,r := (/j — i?/ 2 )/(l — R ), the 
bias of which is 

f( 2 ) 

E[/gj ,r ~ f} = { h iVK u 2 - Rh 2 2 p K2j2 ) + o{h\ + h 2 2 ). 

Hence, setting R = (hlp Klt2 )/(hlp K2t2 ) renders the leading bias exactly zero. Moreover, if 
S > 4, / G j,r has bias 0(hf + h%); behaving as a single estimator with fi = 4. To put this in 
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context of our results, observe that with this choice of R , if we let p = hxfhi-, then 


/gj ,r 




M(u ) = Ki{u) - p 1+2 


K 2 (pu) - p 1 K 1 (u)\ 

hk „ 2(1 - R ) I ^ 


exactly matching Eqn. (9). Or equivalently, f GJ<R — f\ — h\f^p Kl ^-, for the derivative esti¬ 
mator 


; ,2) = 



hAy,2(l — R ) 


Therefore, we can view /gj,/? as a change in the kernel M(-) or an explicit bias estimation 
described directly above with a specific choice of J(-) (depending on p in either case). Again, 
Eqn. (9) holds exactly. Thus, our results cover the generalized jackknife method as well, and 
the same lessons apply. 

Finally, we note that these bias correction methods can be applied to nonparametric re¬ 
gression as well, and local polynomial regression in particular, and that the same conclusions 
are found. We will not repeat this discussion however. 


S.I.5 First Order Properties 

Here we briefly state the first-order properties of T us , T bc , and T rbc , using the common notation 
T V)W defined in Section S.I.l. Recall that r] v = \/nh(E[f v ] — /) is the scaled bias in either case. 
With this notation, we have the following result. 

Lemma 2. Let Assumptions S.I.3.1 and S.I.3.2 hold. Then if nh — > 00 , p v — > 0, and if v = 2, 
p —>• 0 + pt{v = w} < 00 , it holds that T V)W — Af(0,1). 

The conditions on h and h behind the generic assumption that the scaled bias vanishes 
can be read off of (10) and (11): T us requires \fnhhf —> 0 whereas T bc and T rbc require 
only sfnhhflli 1 V b l ) —> 0, and thus accommodate \/nhh h 0 or b 0 (but not both). 
However, bias correction requires a choice of p = h/b. One easily finds that V[VnhBf] = 
0(p 1+2/i ), whence p —> 0 is required for T bc . But T rbc does not suffer from this requirement 
because of our proposed, new Studentization. From a first-order point of view, traditional bias 
correction allows for a larger class of sequences /r, but requires a delicate choice of p (or b ), 
and Hall (1992b) shows that this constraint prevents T bc from improving inference. Our novel 
standard errors remove these constraints, allowing for improvements in bias to carry over to 
improvements in inference. The fact that a wider range of bandwidths is allowed hints at 
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the robustness to tuning parameter choice discussed above and formalized by our Edgeworth 
expansions. 

Remark 6 (p —y oo). T rbc —N(0,1) will hold even for p = oo, under the even weaker 
bias rate restriction that r/ bc = o{p 1 ^ 2+A ), provided nb —*■ oo. In this case Bf dominates the 
first-order approximation, but a 2 hc still accounts for the total variability. However there is 
no gain for inference: the bias properties can not be improved due to the second bias term 
(E[/j — f — Bf), while variance can only be inflated. Thus, we restrict to bounded p. Section 
S.I.2.3 has more discussion on the choice of p. ■ 

S.I.6 Main Result: Edgeworth Expansion 

Recall the generic notation: 

_ Vnh(f v ~ f ) 

V,W * /v ) 

for 1 < w < v < 2. The Edgeworth expansion for the distribution of T VjW will consist of poly¬ 
nomials with coefficients that depend on moments of the kernel(s). Additional polynomials 
are needed beyond those used in the main text for coverage error. These are: 

Pvl(z) = <J>(z)a~ 3 [u V:W (l, 1 , 2 )z 2 /2 - u v (3)(z 2 - l)/6], 

P ( v %(z) = -^(z)a- 3 E[f w ]u VjW ( 1,1,1 )z 2 , and 

The polynomials p'vju are even, and hence cancel out of coverage probability expansions, but 
are used in the expansion of the distribution function itself (or equivalently, the coverage of a 
one-sided confidence interval). 

Next, recall from the main text the polynomials used in coverage error expansions: 

qi{z;K ) = 4{z 3 ~ 3z)/6 - i^k^k^z 3 /3 + (z 5 - ICR 3 + 15z)/9], 

q^z-,K) = ~^ 2 {z), and q^K) = r dJ < 2 2 'd K ,3 (2z 3 /3). 

The corresponding polynomials for expansions of the distribution function are 
qi k l( z ) = \~Y~qk(z;N w ), k = 1,2,3. 

As before, the qi% are odd and hence do not cancel when computing coverage: the qk(z ; N w ) 
in the main text are doubled for just this reason. 
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Note that, despite the notation, qi%(z) depends only on the “denominator” kernel N w . 
The notation comes from the fact that when first computed, the terms which enter into the 
qi k w(z ) depend on both kernels, but the simplifications in Eqn. (16) reduce the dependence to 
N w . This is because for undersmoothing and robust bias correction, v = w, and for traditional 
bias correction N 2 = M = K + o(l) = Ni + o( 1), as p — > 0 is assumed. Thus, when computing 
dM, q the terms with the lowest powers of p will be retained. These can be found by expanding 

d M , q = [ (K{u)-p 1+ ^ KA L^u)) q du = J2(^ {-^K A p l+k ) q ~ j f K{u)iLW{fmy-idu, 


and hence we can write '&M, q = r, h<, q — p l+fl K , q -i + 0(h + p 2+li ). We can thus 
write qj(z ; M) = qj(z ; K) + o(l) in this case. If the expansions were carried out beyond terms 
of order (n/i) -1 + (nh) -1 / 2 ^ + rj 2 + 1 {v^w}p l+2h this would not be the case. 

Finally, for traditional bias correction, there are additional terms in the expansion (see 
discussion in the main text) representing the covariance of / and Bj (denoted by Qj) and the 
variance of Bf (f^)- We now state their precise forms. These arise from the mismatch between 
the variance of the numerator of T bc and the standardization used, <j 2 s , that is <J 2 hc /cr 2 s i s given 
by 


nhV[f - B f ] _ nhV[f} - 2nhC[f, B f ] + nhN[B f } _ nhC[f, B f ] nhV[B f ] 
nhV[f] ~ nhV[f] ~ nhV[f] nhY[f] ' 

This makes clear that Oj and H 2 are the constant portions of the last two terms. We have 


t nhC[f,B f \ 

nhY[f] 


= p 1+r, V 1, 


where 


n 


1 — 


o pKfi 
^l(2) 



h) K (u) L lyh \up)du — b 


f{x — uh)K(u)du 


f{x — ub)L^\u)d 



Note ui( 2) = cr 2 s . Turning to H 2 , using the calculations in Section S.1.4.1 (recall h = k V S), 
we find that 


nhV[B f ] 

nhY[f] 


p 1+2li fl 2 where 0 2 


^ I J f(x- ub)L^\ufdu - b 1+2k (^J L^-~ k \u)f^\x 


— ub)du 


19 










Fully simplifying would yield 


^2 — K,2^ LW ,2i 


which can be used in Theorem 3. 

As a last piece of notation, define the scaled bias as p v = \Znh(E[f v \ — /). 

We can now state our generic Edgeworth expansion, from whence the coverage probability 
expansion results follow immediately. 

Theorem 3. Suppose Assumptions S.I.3.1, S.I.3.2, and S.I.3.3 hold, nh/\og(n) —> oo, rj v —)• 
0, and if v = 2, p —* 0 + pt{v = w}. Then for 


F v ,w{z) = $(z) + ~^=p[ 

ynri 


vl( z ) + \lf x Pvl( z ) + VvP ( vl( z ) + + Vvtfhiz) + 


t{v ^w} p 1+h (fAx + p l ''Q 2 )^-Fp ! -z, 


(f{z) 


we have 


sup |P[T„ jW < z) - F V)W (z) | = o (( nh) 1 + (nh) 1/2 p v + rfi + l{u^u;}p 1+2/i ) . 

zSK 

To use this result to find the expansion of the error in coverage probability of the Normal- 
based confidence interval, the function F ViW (z) is simply evaluated at the two endpoints of 
the interval. (Note: if the confidence interval were instead constructed with the bootstrap, a 
few additional steps are needed, but these do not alter any conclusions or results outside of 
constant terms.) 


S.1.6.1 Undersmoothing vs. Bias-Correction Exhausting all Smooth¬ 
ness 

In general, we have assumed that the level of smoothness was large enough to be inconsequen¬ 
tial in the analysis, and in particular this allowed for characterization of optimal bandwidth 
choices. In this section, in contrast, we take the level of smoothness to be binding, so that we 
can fully utilize the S derivatives and the Holder condition to obtain the best possible rates of 
decay in coverage error for both undersmoothing and robust bias correction, but at the price 
of implementability: the leading bias constants can not be characterized, and hence feasible 
“optimal” bandwidths are not available. 

For undersmoothing, the lowest bias is attained by setting fi > S (see Eqn. (10)), in 
which case the bias is only known to satisfy E[/] — / = 0{h s+q ) (i.e., Bf is identically zero) 
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and bandwidth selection is not feasible. Note that this approach allows for \fnhh s 0, as 
r] us = 0(Vnhh s+q ). 

Robust bias correction has several interesting features here. If h < S — 2 (the top two cases 
in Eqn. (11)), then the bias from approximating E[/] — / by Bf, that is not targeted by bias 
correction, dominates r] hc and prevents robust bias correction from performing as well as the 
best possible infeasible (i.e., oracle) undersmoothing approach. That is, even bias correction 
requires a sufficiently large choice of fi in order to ensure the fastest possible rate of decay in 
coverage error: if ti > S — 1, robust bias correction can attain error decay rate as the best 
undersmoothing approach, and allow \/nhh s 0. 

Within k > S — 1, two cases emerge. On the one hand, if A — S — 1 or S, then Bf 
is nonzero and must be consistently estimated to attain the best rate. Indeed, more is 
required. From Eqn. (11), we will need a bounded, positive p to equalize the bias terms. This 
(again) highlights the advantage of robust bias correction, as the classical procedure would 
enforce p —> 0, and thus underperform. On the other hand, p —> 0 will be required if k > S 
because (from the final case of (11)) we require p r, ~ s = OifT) to attain the same rate as 
undersmoothing. Note that we can accommodate b 7 4 0 (but bounded). Interestingly, Bf is 
identically zero and Bj merely adds noise to the problem, but this noise is fully accounted for 
by the robust standard errors, and hence does not affect the rates of coverage error (though 
the constants of course change). The in Bf is inconsistent does not exist), but the 
nonvanishing bias of is dominated by h k . 

This discussion is summarized by the following result: 

Corollary 7. Let the conditions of Theorem 3 hold. 

(a) If h> S, then 

1 d)( Z2L ) 

P[/ eI UB ] = l-a+ — 1 (K) {1 + o(l)} + O (nh l+2S+ * + /r s+? ) . 

(b) If k > S — l, then 

p[/ e / rbc ] = 1 - a + ^Y~qi(M) (1 + o(l)} 

+ O {nh{li s+q V h k b s ~ k+ql{ ^ s} ) 2 + (h 5+? V h ri b s - k+a{r ^ s} )) . 

S.I.6.2 Multivariate Densities and Derivative Estimation 

We now briefly present state analogues of our results, both for distributional convergence and 
Edgeworth expansions, that cover multivariate data and derivative estimation. The conceptual 
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discussion and implications are similar to those in the main text, once adjusted notationally 
to the present setting, and are hence omitted. 

For a nonnegative integral d-vector q we adopt the notation that: (i) [ q ] = q\ + ■ ■ ■ + q d , 
(ii) g( g )(x) = d^g(x)/(d gi xi ■ ■ ■ d qd Xd), (iii) k\ = gi!---gj, and (iv) for some integer 

Q > 0 denotes the sum over all indexes in the set {g : [g] = Q}. 

The parameter of interest is f^ q \x), for x G and [g] < S. The estimator is 

1 n 

2=1 

Note that here, and below for bias correction, we use a constant, diagonal bandwidth matrix, 
e.g. h x I d . This is for simplicity and comparability, and could be relaxed at notational 
expense. 

The bias, for a given kernel of order fi < S — [g] (we restrict attention to the case where 
S is large enough), is 

h‘ Y. ^K,tf {, * k \x) + o(h‘), 

k:[k-\-q] =k 

exactly mirroring Eqn. (10), where now ftK,k represents a d-dimensional integral. Bias esti¬ 
mation is straightforward, relying on estimates f( q+k \x), for all [k] — H — [g]. The form of 
f ^ (x) = /^ (x) — Bf( q ) (x) is now given by 

1 n 

f 2 q \x) = nhd+[q] M (q) ( x h,i) where M (q) (u) = K {q \u) - { p ) d+ ^ +A ^ g K , k L (q+k) («), 

i =1 [k]=H 

exactly analogous to Eqn. (9). 

With these changes in notation out of the way, we can (re-)define the generic framework 
for both estimators exactly as above. Dropping the point of evaluation x. for v G {1, 2}, define 
the estimator as 

1 n 

fv q) = h d+[ q ] Nv ( Xh .0 ’ where Nl ( u ) = Kiq) ( u ) and n 2 (u) = M( q )(u); 

i—1 

the variance 

a 2 ;= nhd+[q] V [/W] = 1 {E [N v (X h}i ) 2 ] - E [N v (. X h>i )] 2 } 


22 





and its estimator as 


cr„ 


1 

h d 


1 

n 


Y. i N ° ^'o.*) 2 ] 

i= 1 


l 

n 


n 


-l 2 


j2 N v( x h,i) 




and the i-statistics, for 1 < w < v < 2, as, 

Vnh d+2 M (/i 9) - /(«)) 

Tv,W • ^ ■ 

As before, T us = T ]>u T bc = T 2 , i, and T rbc = T 2 , 2 - 

The scaled bias rj v has the same general definition as well: the bias of the numerator of 
the T v>w . In this case, given by 


Vv 


Vnh d+2 M (E 




The asymptotic order of rj v for different settings can be obtained straightforwardly via the 
obvious multivariate extensions of Equation (11) and the corresponding conclusion of Lemma 
1 . 

First-order convergence is now given by the following result, the proof of which is standard. 


Lemma 3. Suppose appropriate multivariate versions of Assumptions S. 1.3.1 and S.1.3.2 hold, 
nh d+2 to! —> oo, Tj v —y 0, and if v = 2, p —>• 0 + pt{v = w}. Then T vw —> d ZNT(0,1). 


For the Edgeworth expansion, redefine 


Vv,w(,j,k,p ) 


1 

/ld+[g]l{j+pfc=l} 


E 


(N v ( Ui ) ~E[N v ( Ui )}) j (N w ( Ui ) p 


®[N v {uiY]) k 


where Ui = (x — Xf)/h. The polynomials pi k l,(z) and qi k l(z) are as given above, but using 
multivariate moments. The analogue of Theorem 3 is given by the following result, which can 
be proven following the same steps as in Section S.I.7. 


Theorem 4. Suppose appropriate multivariate versions of Assumptions S. 1.3.1, S.I.3.2, and 
S.I.3.3 hold, nh d+2 ^/ log(ri) —> oo, rj v —» 0, and if v = 2, p —y 0 + pt{v = w}. Then for 


Fv,w(z) = &(z) + —=p^l(z) + 
Ynh a 


S(«) + \j—^—Pvl( Z ) + VvP ( vl(z) + + VvQi%( z ) + -^=dvi,{ z ) 


A z ) 
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we have 


sup |P [T VtW <z}- F V)W (z)\ = o {[{nh d )- l > 2 + Vv ) 2 + l{v^w}p d+2 ^) . 

zSK 

The same conclusions reached in the main text continue to hold for multivariate and/or 
derivative estimation, both in terms of comparing undersmoothing, bias correction, and robust 
bias correction, as well as for inference-optimal bandwidth choices. In particular, it is straight¬ 
forward that the MSE optimal bandwidth in general has the rate n~ 1 ^ d+2fl+2 ^\ whereas the 
coverage error optimal choice is of order 7 j, _ 1 /( d + A +M). Note that these two fit the same patter 
as in the univariate, level case, with /? + [q\ in place of h and d in place of one. One intuitive 
reason for the similarity is that the number of derivatives in question does not impact that 
variance or higher order moment terms of the expansion, once the scaling is accounted for. 
That is, for all averages beyond the first, for example of the kernel squared, Vnh d can be 
thought of as the effective sample size, since that is the multiplier which stabilizes averages. 

S.I.7 Proof of Main Result 

Throughout C shall be a generic constant that may take different values in different uses. If 
more than one constant is needed, C \, C 2 , ■ •., will be used. It will cause no confusion (as the 
notations never occur in the same place), but in the course of proofs we will frequently write 
s = \/nh, which overlaps with the order of the kernel L. 

The first step is to write T v , w as a smooth function of sums of i.i.d. random variables plus a 
remainder term that is shown to be of higher order. In addition to the notation above, define 

1 n 

7vp = h~'E [N v {X h}i ) p ] and A vJ = - ^ {n v (. X h>i ) j - E [n v (. X h ,} . 

i= 1 

With this notation f v - E[/„] = s _ 1 A ia , a 2 w = E[A^ ;1 ] = 7^2 - h^ 2 w X and 

d 2 w - o 2 w = s _ 1 A Wi2 - h2j Wt is~ 1 A W) i - hs~ 2 AI k1 . ( 12 ) 

By a change of variables 

7vp = h- 1 I N v (. X h>i ) p f(Xi)dXi = I N v (u) p f(x - uh)du = 0(1). 
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Further, by construction E[A^j] = 0 and 


V[A wd ]=h~ 1 E [N v (X h4 ) 2j 
< h- 1 E [N v (. X hji ) 2j 

= lv,2 j =0(1). 


— /i _1 E 


N v (X, 


h.i) 


Returning to Eqn. (12) and applying Markov’s inequality, we find that hs~ 2 1 = n -1 1 = 

O p (n -1 ) andd 2 —a 2 = s~ 1 O p (l) — hO(l)s^ 1 O p (l) — hs^ 2 O p (l) = O p (s _1 ), whence |b 2 — a 2 |" = 
O p (s' 2 ). Using these results preceded by a Taylor expansion, we have 


?i\ 1/2 A , ^ 1/2 

<4/ V ^ 


1 , 3(d 2 -cr 2 ) 2 2 2 2 

= 1 - V-s -+ -- 7 -+ o p ((a w - a w ) ) 


2 o* 


cr , 


= 1 - 


2cr 2 


(s 1 A W}2 - h2 r y W:1 s 1 A U)il ) + O p (n 1 + s 2 ). 


Combining this result with the fact that 


T = 

- L V.W 


A Wj i + r] v A vA rj. 


(j 1 . 


+ — I ^ 
CP/! \ & W 


2 \ -1/2 


we have 


P[r„ jW < z] = p 


rp _ TD . _ Vv 

v,w ^b vw ^ ^ 

(Jin 


(13) 


where 


r Wj10 = (s 1 A w>2 - h2^ W)1 s 

(Tin Z(J nn 


and is a smooth function of sums of i.i.d. random variables and the remainder term is 
Vv A _ 2 A*,i , 3(d 2 -a 2 ) 2 


R — — hg~ z + -- 
cr,„ V 2 cry 8 


cu 


+ o P ((d 2 -a 2 ) 2 ) . 


Next we apply the delta method, see Hall (1992a, Chapter 2.7) or Andrews (2002, Lemma 
5(a)). It will be true that 


P [T v , w < z] = P 




Vv 

(Jo, 


+ o(s~ 2 ) 


(14) 
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if it can be shown that s 2 P[|i?„ )U) | > £ 2 s~ 2 log(s) -1 ] = o(l). 1 This can be demonstrated 
by applying Bernstein’s inequality to each piece of R VyW , as the kernels K and L , and their 
derivatives, are bounded. 

To apply this inequality to the first term of R v , w , note that \N w ((x — Xi)/h)\ < G\ and 
that Y[N w ((x — Xi)/h)\ < C 2 h, for different constants, and so for e > 0 we have 


s 2 P 


A 2 

Vv r ^ 2 -2 1 / \—1 

—hs —X > e s log(s) 
(Jin 2 ( 7 .,. 


= s 2 P 


= s 2 P 


i =1 
n 

[N w (X hti }}} 


> £S _ 1 log(s ) _1/2 / P”" S2N ^ 


Vv 


1=1 


> e 


2cr In \ 1/2 


Vv log(s) 


< 2s 2 exp { -- 

< s 2 exp { -C 


£ 2 °lnv v lo s0 


i-i 


2 C 2 nh + \eC 1 ^2al J n/[r} v log(s)] 
£ 2 log(s) -1 


Vh + £y/vv/[n\og(s)] 


< exp 2 Ci log(s) 


1 -Ci 


r]hlog(s) 2 + £y/r] v log(s) 3 /n\ 


which tends to zero because r] v —» 0 as n -A oo is assumed. To see why, note first that 
the second term of the denominator automatically vanishes, as rj v —> 0 and log(s) 3 /n —> 0. 
Second, suppose rj 2 x nhR (for example, if r/ us x sh h , then oo = 1 + 2/1) and the first term 
diverges, it must be that h is at least as large (in order) as 

/ ^ \ V(2+o?) 

\nlog(s) 4 J 

which makes the requirement that r/„ —> 0 equivalent to 


r/l x nh u = n 1_tj/(2+u) log(s)~ 4aj/(2+w) —>■ 0, 


which is impossible. The remaining terms of R VtW , characterized using Eqn. (12), are handled 
in exactly the same way. This establishes Eqn. (14). 

Next, the proofs of (Hall, 1992a, Chapters 4.4 and 5.5) show that T v>w has an Edgeworth 
expansion valid through o(s~ 2 + s~ 1 7] v + tj 2 ). Thus, for a smooth function G(z) we can write 

1 Here, s -2 log(s) -1 may be replaced with any sequence that is o(s~ 2 + ry 2 + s _1 ?%). 
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P [T V:W < z] — G(z) + o(s 2 + s 1 rj v + 7/ 2 ). Therefore 

— G m (z) + o(s~ 2 + + 17;5). 


p 

T Vv " 

-L v,w ^ ^ 

= P 

Ty,w ^ ^ 






(15) 


The final result now follows by combining Equations (13), (14), and (15) with the terms 
of the expansion computed below. □ 


S.1.7.1 Computing the Terms of the Expansion 

Identifying the terms of the expansion is a matter of straightforward, if tedious, calculation. 
The Erst four cumulants of T V)W must be calculated, which are functions of the first four 
moments. In what follows, we give a short summary. Note well that we always discard higher- 
order terms for brevity, and to save notation we will write = to stand in for “equal up to 
o{(nh)~ l + (nh) -1 / 2 rj v + 7 2 + 1 {v^w}p l+2k )' n . 

Referring to the Taylor expansion above, for the purpose of computing moments and 
cumulants, we can use 

rr -1 / Vv \ f & ^W,2 h^YlVjlS ^W,l 3 S ^W,2 

T v , w « I-1-) | 1---1-h x 


® W 


2a,, 


(T71 


Gt 


Moments of the two sides agree up to the requisite order. Straightforward moment calculations 
then give 

B[T , t _o s^EfA^A^] | /is-^^.iEfA^^A^i] , 3s- 2 E[A t)i iA^ 2 ] | Vv | 3s" 2 7/„E[A 2 i2 ] 

^[-*-V,w\ n o ~"T O ' n c "I -- ! - 


2m 3 . 


W 


g: 


w 


8 at 
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2cr 3 , cr 3 a v ’ 


W 


' w 


2 _ 2 E[A» fl Ay , _ X E[A 2 1 A W)2 ] , _ l7w ,iE[A 2 ^ 


EKJ A 
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+ s" 


+ 2 hs~ 


at 


at 


at 


at 


_ 1 2E[A^iA U) 2 ] _i 4 7w X E[A V 1 A^ 1 ] r? 2 

- Tj v s - - -h rjvhs -o-h — 


at 


at 


at 


o_ CR -2 _ 2 2t/„, 1 u(l, 1,2) 2 _ 2 ^ |W (2,1,2) 2 _ 1 2^ [W (1,1,2) | 

O “T 5 « S p ' O 5 

(T z (J D CT° G z G z <T Z 

oil on on on on on 


3 1 o P[A^i] -iEfA^A^] _ i 7 w , i iE[A 3 j 1 A IU) i] 3E[A 2 J _ 1 9E[A 2 )1 A j/)) 2] 
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cr J 


2cr 5 

in 
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G° 
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+ Vv -- VvS~ 
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o —1*4(3) _i 9iy, )ti ,(l, 1, 2)o" v _i 9^fw,iVv,w (I; I; 1) 3cr 
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+ Vv W’ 


<J?, 


and, 


„„ 4 , o e[a;j .^^EIAJ^A^J _ 2 3E[A^AJ, ] 

E|T„ J =-r J -S - * -+ 4/IS 


(J 4 (J 6 (J 6 

w w w 

. «[A? ] ..SElA’jA^] 2 6 E|A; ] 

v» —— v - ± -+ n. 


+ S" 


cr° 


(T1 


u 


lv a 4 

'-s in 


o_ 2 ^( 4 ) , Q <4 fi _ 2 8*4(3)*4,«j(l, 1, 2) + 12cr|z/„ )lt; (2,1, 2) ; ^_ 2 9cr^ )W (0,2, 2) 

— 5 „ “r o . S c “r 5 0 


at at 


(J 

it 


(J 

'- y it 


+ s 


- 2 36or>^(l,l,2) 2 _ 1 4z/„(3) _ 1 24 C tX-( 1,1,2) , 2 6a, 


er c 

<>1 


+ VvS —t -T7„s- 


(T° 


+^x- 


cul 


The expansion now follows, formally, from the following steps. First, combining the above 
moments into cnmnlants. Second, these cnmnlants may be simplified using that 


-f- = 1 +1 (w^v) (p 1+/ X + p 1+2r, n 2 ) 

Gw 

and in all cases present 


Vv,w{i,j,p) = f^N v ,i+j P + o(l) • (16) 

The second relation is readily proven for v = w, as v V:V (i,j,p) = E[A^(A"/ lii )* + - ?p ] + 0(h), 
where the remainder represents products of expectations. In the case for v 7 ^ w, we find 
* 4,1 (i,j,p) = f’^Ni ,’i+jp + 0(p 1+li + h), and in this case p — > 0 is assumed. For any term of a 
cumnlant with a rate of ( nh ) -1 , (nh) _1//2 ?^, ?/ 2 , or p 1+2h (i.e., the extent of the expansion), 
these simplifications may be inserted as the remainder will be negligible. Note that this is 
exactly why the polynomials do not simplify, while the qi k w do. Third, with the cumulants 
in hand, the terms of the expansion are determined as described by e.g., Hall (1992a, Chapter 
2 )- 

Finally, for traditional bias correction, there are additional terms in the expansion (see 
discussion in the main text) representing the covariance of / and Bf (denoted by Oj) and the 
variance of Bf (fi 2 ). We now state their precise forms. These arise from the mismatch between 
the variance of the numerator of T bc and the standardization used, cr 2 s , that is cr 2 bc /cr 2 s is given 
by 

nhY\f - B f ] _ nhY[f] - 2nliC[f, B f ] + nhV[B f \ _ nhC[f, B f \ nhY[B f \ 
nhY[f] ~ nhY[f] ~ nhV[f] nhY[f] ' 
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This makes clear that Qi and H 2 are the constant portions of the last two terms. First, for 

h2i, 


C[f,B f ]=E 

= h k HK,, 


1 

nil 


E K (wo 


h* l*K,i ! 


2=1 

1 

nb 1+r> 


nb l+k 


££“>(*«) 


2=1 


E [h-'KiXhjLW (X M )] 

- ME [h-'KiXbJ] E [i b (X 6 ,)] 


— | I f(x — uh)K(u)L ( - lt \up)du — b / f(x — uh)K(u)du / f(x — ub)L^'\u)d' 


u 


Therefore 


-2 


nhC [/,!?/] 
nh¥[/] 


= p 1+A ni, 


where 


ill = —2 | J f(x — uh)K(u)L^\up)du — b j f(x — uh)K{u)du J f(x — ub)L^\u)du 

Note z/i(2) = cr 2 s . If we did not include f ^2 in the Edgeworth expansion, i.e. we stopped at 
order p 1+li , then we could capture only the leading terms of f^, as follows, using that kernel 
integrates to 1 and p —> 0, 


Q i = -2^|| {/ /(x - uh)K{u)L^\up)du -bjf{x- uh)K(u)du j f(x - ub)L^\u)du 

= - 2 f ( x )^ K K 2 + 0 ( h ) {/^) L( " ) (°)[ 1 + °( h + M] - J L^(u)du[ 1 + 0(6 + 6 ,)]} 

-2p KA i9- 2 2 L^\0). 


Note that this matches the term Hall (1992b) calls w^- We do not do this, for completeness. 
There are no other terms of up to order p 1+2/i , so capturing the full contribution of o\/ g\ — 1 = 
^rbc/^us — 1 is natural and informative. 

Turning to using the calculations in Section S.1.4.1 (recall b = k V S ), we End that 

V[B/] = | WfE [6-'Ll*) (A-,,) 2 ] - (TjE [£(*) (A t ,,)]) 2 J 
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and hence 


n 2 li n 2 
P h L Kfi 

rib 


f(x — ub)L^\u) 2 du — b 1+2h [ / L^' h \u)f^ h \x — ub)du 


nhY $f] = p i+2 w h ere D 2 = ( [ f(x- ub)L^\ufdu - fe 1+2/l f [ L^~ l \u)f^(x 

nhN[f] u \ (2) I J 

The final piece will be b l+2S f^\x) 2 [ 1 + o(l)] if /? < S'. Substituting this is permitted because 
pi+ 2/ ? j g j-pg lirrrit of the expansion, though it is not necessary to do, because this term is always 
higher order. Fully simplifying would yield 

“^2 = K,2^ LW ,2i 

which can be used in Theorem 3. 


S.I.8 Complete Simulation Results 

To illustrate the gains from robust bias correction we conduct a Monte Carlo study to compare 
undersmoothing, traditional bias correction, and robust bias correction in terms coverage 
accuracy and interval length using several data-driven procedures to select the bandwidth. 
We generate n = 500 observations from a true density / evaluated at x = {—2, —1, 0,1, 2}. 
For the density, we consider: 

Model 1 (Gaussian Density): x N(0,1) 

Model 2 (Skewed Unimodal Density): x ^ |N(0,1) + (|) 2 j + |3\f (|) 2 j 

Model 3 (Bimodal Density): x ^ 1, (|) 2 j + ^1, (|) 2 j 

Model 4 (Asymmetric Bimodal Density): x ^ |N(0,1) + (j, (^) 2 j 

These models were previously analyzed in Marron and Wand (1992). They are plotted in 
Figure S.I.l. In this simulation study we compare the performance of the confidence intervals 
defined by T us , T bc , and T rbc . For T us , we take K to be the Epanechnikov kernel, while bias 
correction uses the Epanechnikov and MSE-optimal kernels for K and L^ 2 \ respectively. We 
consider two main data-drive bandwidth selectors. First, a Silverman rule-of-thumb alter¬ 
native h TOt = <T2.34n~ 1 /( 2r+1 \ Second, the direct plug-in (DPI) for coverage error optimal 


— ub)du 
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bandwidth h dpi = H dpi n~ 1 ^ r+3 \ where fP dpi uses h TOt as a pilot bandwidth to estimate /h+ 2 ) 
consistently. We also include the unfeasible, population value for h mse . 

Empirical coverage and length are reported in Tables S.I.2-S.I.5 (Panel A) using our two 
proposed data-driven bandwidth selectors, as well as the infeasible h mse . The most obvious 
finding is that robust bias correction has accurate coverage for all bandwidth choices in all 
models. The intervals are generally longer than for undersmoothing, but neither undersmooth¬ 
ing nor traditional bias correction yield correct coverage outside of a few special cases (e.g., 
undersmoothing at the infeasible MSE-optimal bandwidth in Model 4). The DPI bandwidth 
selector generally results in slightly smaller bandwidths (on average). Summary statistics for 
the two fully data-driven bandwidths are shown in Panel B. The fact that the DPI bandwidth 
is slightly smaller is born out. It is also, in general, more variable. 

To illustrate the robustness to tuning parameter selection, Figures S.I.2-S.I.9 show cover¬ 
age and length for all four models. The dotted vertical line shows the population MSE-optimal 
bandwidth for reference. These figures demonstrate the delicate balance required for under¬ 
smoothing to provide correct coverage, whereas for a wide range of bandwidths robust bias 
correction provides correct coverage. Further, interval length is not unduly inflated for band- 
widths that provide correct coverage. Recall that robust bias correction can accommodate, 
and will optimally employ, a larger bandwidth, yielding higher precision. Further emphasizing 
the point of robustness, we depart from p = 1 in Figures S.I.10 and S.I.ll to show coverage 
and length over a grid of h and p. 

The simulation results for local polynomial regression reported in Section S.II.7 below bear 
out these same conclusions and study these issues in more detail, in particular interval length. 
All our methods are implemented in software available from the authors’ websites and via the 
R package nprobust available at https ://cran.r-project. org/package=nprobust. 
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Figure S.I.l: Density Functions 




(a) Model 1 


(b) Model 2 




(c) Model 3 


(d) Model 4 
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Tabic S.I.2: Simulations Results for Model 1 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 



Bandwidth 

Empirical Coverage 

Interval Length 



US 

BC 

RBC 

US 

RBC 

x = — 2 

^mse 

0.677 

91.5 

81.1 

93.9 

0.039 

0.055 

^rot 

0.674 

90.6 

81.2 

93.9 

0.039 

0.055 

^dpi 

0.709 

86.7 

80.7 

93.7 

0.038 

0.054 

x = — 1 

^mse 

0.677 

94.5 

78.0 

94.3 

0.069 

0.109 

^rot 

0.674 

94.5 

78.0 

94.0 

0.069 

0.109 

^dpi 

0.748 

93.2 

80.0 

94.7 

0.067 

0.106 

x = 0 

^mse 

0.677 

85.5 

74.8 

95.0 

0.078 

0.132 

^rot 

0.674 

84.4 

74.8 

94.9 

0.078 

0.132 

^dpi 

0.448 

91.1 

77.9 

95.0 

0.109 

0.172 

X — 1 

^mse 

0.677 

94.9 

78.3 

94.5 

0.069 

0.109 

^rot 

0.674 

94.7 

78.3 

94.6 

0.069 

0.109 

^dpi 

0.751 

93.8 

80.2 

95.1 

0.066 

0.105 

x = 2 

^mse 

0.677 

92.0 

83.2 

94.3 

0.038 

0.055 

^rot 

0.674 

91.2 

83.2 

94.3 

0.039 

0.055 

^dpi 

0.707 

87.7 

82.7 

94.2 

0.038 

0.054 


Panel B: Summary Statistics for the Estimated Bandwidths 



Pop. Par. 

Min. 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = — 2 

^rot 

^dpi 

0.677 

0.5962 

0.5597 

0.6597 

0.6492 

0.6746 

0.6856 

0.6744 

0.7093 

0.6888 

0.7354 

0.7503 

2.394 

0.021 

0.11 

X = — 1 

hrot 

^dpi 

0.677 

0.5962 

0.406 

0.6597 

0.6196 

0.6746 

0.7044 

0.6744 

0.7484 

0.6888 

0.8191 

0.7503 

2.885 

0.021 

0.22 

x = 0 

^rot 

^dpi 

0.677 

0.5962 

0.3499 

0.6597 

0.4084 

0.6746 

0.4324 

0.6744 

0.4478 

0.6888 

0.4638 

0.7503 

2.885 

0.021 

0.084 

X = 1 

hrot 

^dpi 

0.677 

0.5962 

0.4099 

0.6597 

0.6241 

0.6746 

0.7097 

0.6744 

0.751 

0.6888 

0.8227 

0.7503 

2.885 

0.021 

0.21 

x = 2 

^rot 

^dpi 

0.677 

0.5962 

0.553 

0.6597 

0.6506 

0.6746 

0.6864 

0.6744 

0.7071 

0.6888 

0.7365 

0.7503 

1.629 

0.021 

0.095 


Notes: 

(i) US = Undersmoothing, BC = Bias Corrected, RBS3= Robust Bias Corrected. 

(ii) Columns under “Bandwidth” report the average estimated bandwidths choices, as appropriate, for 
bandwidth h n . 



Tabic S.I.3: Simulations Results for Model 2 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 



Bandwidth 

Empirical Coverage 

Interval Length 



US 

BC 

RBC 

US 

RBC 

x = — 2 

^mse 

0.454 

90.5 

79.1 

85.5 

0.021 

0.028 

^rot 

0.551 

91.7 

79.3 

87.6 

0.019 

0.026 

^dpi 

0.741 

92.0 

82.6 

89.8 

0.018 

0.025 

x = — 1 

^mse 

0.454 

94.4 

80.4 

92.9 

0.048 

0.069 

^rot 

0.551 

94.1 

80.7 

93.1 

0.043 

0.063 

^dpi 

0.684 

89.0 

81.2 

94.0 

0.041 

0.059 

x = 0 

^mse 

0.454 

94.9 

81.0 

95.1 

0.089 

0.135 

^rot 

0.551 

93.0 

80.9 

95.0 

0.079 

0.122 

^dpi 

0.503 

92.2 

80.6 

94.6 

0.084 

0.128 

X — 1 

^mse 

0.454 

83.8 

76.1 

94.9 

0.115 

0.193 

^rot 

0.551 

62.2 

74.1 

94.4 

0.097 

0.169 

^dpi 

0.311 

91.7 

77.0 

94.4 

0.154 

0.244 

x = 2 

^mse 

0.454 

90.6 

82.1 

93.8 

0.071 

0.104 

^rot 

0.551 

79.4 

81.8 

94.4 

0.064 

0.095 

^dpi 

0.466 

87.3 

81.5 

94.0 

0.070 

0.102 


Panel B: Summary Statistics for the Estimated Bandwidths 



Pop. Par. 

Min. 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = — 2 

^rot 

^dpi 

0.454 

0.477 

0.411 

0.5365 

0.5546 

0.5504 

0.6643 

0.5506 

0.741 

0.5649 

0.8488 

0.6424 

2.885 

0.021 

0.27 

X = — 1 

hrot 

^dpi 

0.454 

0.477 

0.3724 

0.5365 

0.5253 

0.5504 

0.6442 

0.5506 

0.6837 

0.5649 

0.7681 

0.6424 

2.885 

0.021 

0.23 

x = 0 

^rot 

^dpi 

0.454 

0.477 

0.3902 

0.5365 

0.4693 

0.5504 

0.4934 

0.5506 

0.5027 

0.5649 

0.5239 

0.6424 

1.491 

0.021 

0.058 

X = 1 

hrot 

^dpi 

0.454 

0.477 

0.2545 

0.5365 

0.2989 

0.5504 

0.3093 

0.5506 

0.3106 

0.5649 

0.321 

0.6424 

0.4302 

0.021 

0.017 

x = 2 

^rot 

^dpi 

0.454 

0.477 

0.3955 

0.5365 

0.4474 

0.5504 

0.4637 

0.5506 

0.4661 

0.5649 

0.4826 

0.6424 

0.7111 

0.021 

0.028 


Notes: 

(i) US = Undersmoothing, BC = Bias Corrected, RB$4= Robust Bias Corrected. 

(ii) Columns under “Bandwidth” report the average estimated bandwidths choices, as appropriate, for 
bandwidth h n . 



Tabic S.I.4: Simulations Results for Model 3 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 



Bandwidth 

Empirical Coverage 

Interval Length 



US 

BC 

RBC 

US 

RBC 

x = — 2 

^mse 

0.533 

93.4 

81.8 

94.1 

0.056 

0.083 

^rot 

0.811 

77.4 

81.4 

94.3 

0.045 

0.067 

^dpi 

0.839 

72.1 

78.0 

92.7 

0.045 

0.066 

x = — 1 

^mse 

0.533 

87.3 

77.9 

94.6 

0.087 

0.136 

^rot 

0.811 

45.6 

75.0 

94.1 

0.064 

0.105 

^dpi 

0.467 

88.9 

78.7 

94.4 

0.095 

0.147 

x = 0 

^mse 

0.533 

89.8 

79.8 

94.3 

0.076 

0.114 

^rot 

0.811 

52.0 

78.0 

94.9 

0.058 

0.092 

^dpi 

0.646 

79.6 

78.2 

94.5 

0.067 

0.103 

X — 1 

^mse 

0.533 

87.0 

78.0 

94.3 

0.087 

0.136 

^rot 

0.811 

47.8 

74.9 

94.1 

0.064 

0.105 

^dpi 

0.467 

88.9 

78.6 

94.3 

0.095 

0.147 

x = 2 

^mse 

0.533 

93.5 

80.4 

93.6 

0.056 

0.082 

^rot 

0.811 

77.4 

80.8 

94.6 

0.045 

0.067 

^dpi 

0.839 

72.7 

77.4 

92.3 

0.045 

0.066 


Panel B: Summary Statistics for the Estimated Bandwidths 



Pop. Par. 

Min. 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = — 2 

^rot 

^dpi 

0.533 

0.7394 

0.5429 

0.7991 

0.7482 

0.8112 

0.8023 

0.8113 

0.839 

0.824 

0.8828 

0.8851 

2.885 

0.019 

0.15 

X — — 1 

hrot 

0.533 

0.7394 

0.7991 

0.8112 

0.8113 

0.824 

0.8851 

0.019 

^dpi 

- 

0.4027 

0.4501 

0.4646 

0.4673 

0.4813 

0.6241 

0.025 

x = 0 

^rot 

0.533 

0.7394 

0.7991 

0.8112 

0.8113 

0.824 

0.8851 

0.019 

^dpi 

- 

0.5623 

0.6236 

0.642 

0.6465 

0.6644 

0.8991 

0.034 

X — 1 

hrot 

0.533 

0.7394 

0.7991 

0.8112 

0.8113 

0.824 

0.8851 

0.019 

^dpi 

- 

0.4 

0.4497 

0.4643 

0.467 

0.4814 

0.7607 

0.025 

x = 2 

^rot 

0.533 

0.7394 

0.7991 

0.8112 

0.8113 

0.824 

0.8851 

0.019 

^dpi 

- 

0.6142 

0.7495 

0.8 

0.8391 

0.878 

2.885 

0.16 


Notes: 

(i) US = Undersmoothing, BC = Bias Corrected, RBSh= Robust Bias Corrected. 

(ii) Columns under “Bandwidth” report the average estimated bandwidths choices, as appropriate, for 
bandwidth h n . 



Tabic S.I.5: Simulations Results for Model 4 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 



Bandwidth 

Empirical Coverage 

Interval Length 



US 

BC 

RBC 

US 

RBC 

x = — 2 

^mse 

0.395 

94.3 

81.4 

92.5 

0.043 

0.062 

^rot 

0.739 

90.3 

82.2 

93.9 

0.032 

0.046 

^dpi 

0.766 

87.0 

82.0 

93.7 

0.032 

0.045 

x = — 1 

^mse 

0.395 

94.4 

80.1 

94.1 

0.086 

0.128 

^rot 

0.739 

94.6 

79.2 

94.2 

0.059 

0.091 

^dpi 

0.745 

93.6 

79.9 

95.1 

0.062 

0.095 

x = 0 

^mse 

0.395 

94.1 

79.4 

94.8 

0.105 

0.161 

^rot 

0.739 

85.2 

77.0 

95.0 

0.069 

0.112 

^dpi 

0.785 

66.3 

76.5 

95.1 

0.069 

0.112 

X — 1 

^mse 

0.395 

93.2 

79.6 

95.3 

0.104 

0.158 

^rot 

0.739 

67.0 

73.2 

93.6 

0.068 

0.112 

^dpi 

0.590 

89.9 

82.7 

96.9 

0.083 

0.131 

x = 2 

^mse 

0.395 

90.7 

82.4 

94.2 

0.079 

0.115 

hrot 

0.739 

33.8 

74.0 

91.9 

0.057 

0.085 

^dpi 

0.762 

33.5 

75.3 

92.0 

0.058 

0.087 


Panel B: Summary Statistics for the Estimated Bandwidths 



Pop. Par. 

Min. 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = — 2 









^rot 

0.395 

0.6624 

0.7261 

0.7396 

0.7395 

0.7526 

0.8147 

0.02 

^dpi 

- 

0.5778 

0.7103 

0.7466 

0.7664 

0.7949 

2.885 

0.11 

X = — 1 









hrot 

0.395 

0.6624 

0.7261 

0.7396 

0.7395 

0.7526 

0.8147 

0.02 

^dpi 

- 

0.4469 

0.5654 

0.6646 

0.7448 

0.8531 

2.885 

0.26 

x = 0 









^rot 

0.395 

0.6624 

0.7261 

0.7396 

0.7395 

0.7526 

0.8147 

0.02 

^dpi 

- 

0.414 

0.6018 

0.7431 

0.7855 

0.8893 

2.885 

0.26 

X = 1 

hrot 

0.395 

0.6624 

0.7261 

0.7396 

0.7395 

0.7526 

0.8147 

0.02 

^dpi 

- 

0.4047 

0.4915 

0.5302 

0.5896 

0.6039 

2.636 

0.18 

x = 2 









^rot 

0.395 

0.6624 

0.7261 

0.7396 

0.7395 

0.7526 

0.8147 

0.02 

^dpi 

- 

0.4532 

0.5822 

0.6896 

0.7617 

0.8656 

2.885 

0.25 


Notes: 

(i) US = Undersmoothing, BC = Bias Corrected, RBS6= Robust Bias Corrected. 

(ii) Columns under “Bandwidth” report the average estimated bandwidths choices, as appropriate, for 
bandwidth h n . 



Figure S.I.2: Empirical Coverage of 95% Confidence Intervals - Model 1 
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Figure S.I.3: Empirical Coverage of 95% Confidence Intervals - Model 2 
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Figure S.I.4: Empirical Coverage of 95% Confidence Intervals - Model 3 
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Figure S.I.5: Empirical Coverage of 95% Confidence Intervals - Model 4 
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Figure S.I.7: Average Interval Length of 95% Confidence Intervals - Model 2 
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Figure S.I.8: Average Interval Length of 95% Confidence Intervals - Model 3 
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Figure S.I.10: Empirical Coverage of 95% Confidence Intervals (x = 0) 



(a) Model f 


(b) Model 2 



(c) Model 3 


(d) Model 4 


45 


















Figure S.I.ll: Average Interval Length of 95% Confidence Intervals (x = 0) 




(a) Model 1 (b) Model 2 




(c) Model 3 


(d) Model 4 
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Part S.II 


Local Polynomial Estimation and 
Inference 

S.II.l Notation 

Local polynomial regression is notationally demanding, and the Edgeworth expansions will be 
substantially more so. For ease of reference, we collect all notation here regardless of where it 
is introduced and used. Much of the notation is fully restated later, when needed. As such, 
this subsection is designed more for reference, and is not easily readable. 

Throughout, a subscript p will generally refer to a quantity used to estimate m(x) = 

E[Y t \X t = x], while a subscript q will refer to the bias correction portion (the vectors e o and 
e p+ i below are notable exceptions to this rule). Recall that p > 1 is odd and q > p may be 
even or odd. 

Throughout this section let X fhl = (X t — x)/h and similarly for X b l . The evaluation point 
is implicit here. 

To save notation, products of functions will be written together, with only one argument. 

For example 

(Kr p r' p )(X hti ) := K(X hti )r p (X hti )r p (X h:i y = K r p {~j~^ r P 

and similarly for (Kr p )(X hi ), (Lr q )(X b}i ), etc. 

All expectations are fixed-n calculations. To give concrete examples of this notation (A P) fc, 

R p , and W p are redefined below): 

1 n 

V = - *)/hy**,■■■, ((V. - x)/hy +i ]'/n = — Y,(Kr r )(X h JX {+ 1 

i= 1 

and 

A P ,k = E[A m ] = h- l E[(Kr p )(X h ,)X p h f] = h” 1 j K r P f( X i) dX < 

Here the range of integration is explicit, but in general it will not be. This is important 
for boundary issues, where the notation is generally unchanged, and it is to be understood 
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that moments and moments of the kernel be replaced by the appropriate truncated version. 
Continuing this example, if supp{A"} = [0, oo) and x — 0, then by a change of variables 

f‘ coo 

A P ,k = h ~ 1 {KrMXhjXfffiXjdX^ / (Kr p )(u)u p+k f(-uli)du, 

■/suppjX} Jo 


whereas if suppjX} = (—oo, 0] and x — 0, then 
A P ,k = f (■ Kr p )(u)u p+k f(-uh)du. 

J — OO 

For the remainder of this section, the notation is left generic. 

For the proofs (Section S.II.6) we will frequently abbreviate s = \fr\Ji. 


S.II. 1.1 Estimators, Variances, and Studentized Statistics 

To define the estimator rh of m and the bias correction, begin by defining: 


r p (u) = (1 ,u,u 2 ,...,u p )', 
W p = diag (h~ 1 K(X hii ) : i = 

F p = R' p W p R p /n , and 


Rp [r p (X H ,l), i i 

1, —, n) , H p = diag (l, h _1 , /r“ 2 ,..., h ~ p ) , 

yP+k yP+k 1 ' , 

■X-h, 1 > ’ ’ ’ > A /i,n / n ’ 


A p,k = R'pWp 


(17) 


where diag(a, : i = 1,... ,n) denote the n xn diagonal matrix constructed using the elements 
ai, a 2 , • • • , a n . Note that in the main text A p l is dnoted by A p . 

Similarly, define 


r q (u) = (l ,u,u 2 ,...,u q )', 
W q = diag (r 1 L( X M ) : i = 

r g = R' q W q R q /n, and 


R q = [r q {X b , i),-- - ,r q (X htn )]', 

1, • ■ ■, n) , H q = diag (l, b~ x , b ~ 2 ,..., b ~ q ) , 


A q, k = R’qWq 


Y q+k . . . Y w+k /n 
^ 6,1 ) i ^b,n 


w-\-k 

n 


(18) 


These are identical, but substituting q, L, and b in place of p, K, and h, respectively. Note 
that some dimensions change but other do not: for example, W p and W q are both n x n, but 
T p is (p + 1) square whereas is (q + 1). 

Denote by eo the (p + l)-vector with a one in the first position and zeros in the remaining 
and Y = (Yi, • • • , Y n )'. The local polynomial estimator of m(x) = E[Y^|Aj = x] is 


m = e' 0 [3 p = e' 0 H p T p 1 R p W p Y/n, 
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where 


$ p = argmin 4 V\{ Y i - r P (X t - x)'b) 2 K (X hA ) = H p T p l R'W p Y/n. 

beRp +1 nh ~[ 

If we define R = [r p (X\_ — x), ■ ■ ■ , r p {X n — x)]' and M = [m(X i),..., m(X n )]', then we can 
split m — m into the variance and bias terms 

m — m — e' 0 T~ 1 R p W p (Y - M)/n + e' 0 T~ l R! p W p (M - R(3 p )/n. 

This will be useful in the course of the proofs. 

The conditional bias is given by 

E[m|Ab, • • •, X n ] - m= h p+l m [p+l) + epT^Ap,! + o P (h p+1 ). (19) 

(Recall that in the main paper, A Pi i is denoted A p .) This result is valid for p odd, our main 
focus, but also for p even at boundary points. 

Denote by e p+ \ the (q + l)-vector with one in the p + 2 position, and zeros in the rest. 
Then we estimate the bias as 

B m = h p+1 m^ —^e'T^V, where m (p+1) = [(p+iyy^H^R'^Y/n. 

The bias corrected estimator can then be written 

m - B m = - h’* 1 e' 0 r; 1 A r , 1 e' p+1 H,rpR' q W q Y/n 

= e' 0 rp ( R’ p w r - / f +1 \ p , l e' p+1 r- 1 R' q w,) Y/n, 

using the fact that e p+1 H q = l?+ l e' p+1 . 

The fixed-n variances are 

:= (nh)Y[m\X u • • • , X n ] = e'T; 1 (hR' p W p YW p R p /n) T^e 0 (20) 

and 

<4>c ; = (nh)v[m - B m \X t ,..., X„] 

= e’X P(h/n) (R' p W p - S (K W r ~ R ’, W -)' Y e °’ 

( 21 ) 
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where 


£ = diag(v(Ab) : i = 1,..., n), with v(x) = V[Y \X = x]. 


These are the closest analogue to the density case, but are still random due to the condi¬ 
tioning on the covariates. Their respective estimators are 


cr„ 


= e' 0 Tp 1 (hR' p W p £, p W p R p r p 1 /n^ e 0 


and 


a. 


rbc 


= e' 0 r;'(h/n) (RpWp - fP+ 2 hp,ie'p+X- 1 R! q W q ) t, (R’ P W P - ^Rp.Wp+X^R’pW,)'T~ p \ a . 


The conditional variance matrixes are estimated as 


£ p = diag(u(Xj) :i = l,...,n), with v(Xi) = (Y* - r p (X, : - x)'(3 p ) 2 , 


and 


t, q = diag(-DpQ) : i = 1,..., n), with v(Xj) = (Y* - r q (Xi - x)'$ q ) 2 . 

The Studentized statistics of interest are then: 

\fnh{m — m) Vnh{;m — B m — m) \/nh{rh — B m — m) 

us “ ? be “ i rbc “ • 

^us ^"us ^rbc 

The main result of this section is an Edgeworth expansion of the distribution function of these 
statistics. 

S.II.1.2 Edgeworth Expansion Terms 

The terms of the Edgeworth expansion require further notation and discussion. The expres¬ 
sions are not nearly as compact as in the density case (cf. Section S.I.6). 

Define the expectations of T p , T q , A Pjkj and A qik as T p , T qj A p>k , and A qk , such as 

f p = E [T p ] = E [h-\Kr p r' p ){X hji )\ . 

These will be used to define nonrandom biases and variances that appear in the expansions. 
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The biases are defined in Eqn. (22), and are given by 


Vns = Vnh J e^fp 1 K(u)r p (u ) (m(x — uh ) — r p {uh)'/3 P ) f(x — uh)du , 

? 7 bc = J e^Tp 1 K(u)r p (u) (m(x - uh) - r p+1 (uh)’f3 p+1 ) f(x - uh)du 

— Vnhff +1 / e' 0 f 1 A Pi iep +1 f ~ 1 L(u)r q (u) (m(x — w&) — r q (ub)'(3 q ) f(x — ub)d 


u. 


Further discussion and leading terms are found in Section S.II.4. 

The fixed-n variances are computed conditionally, and we must replace them with their 
nonrandom analogues (just as r/ us and r/ bc must be nonrandom). Recalling Equations (20) and 
(21), define 


ffus : = e o r p ' ^pTp 1 e 0 , 


where 


= E [Tp] and := hR! p W p YW p R p /n, 


and 


*rbc e oTp 1 ^ , gE p 1 e 0 


where 

= E [ttj and T, := h ( R’ p W p - f? +2 A p , l t q 1 R’ q W^ £ (R' p W p /n - ff +2 A^f- 1 R' q W q /nj . 

In the course of the proofs, we will also use = hR' p W p T, p W p R p /n and the analogously- 
defined Tp. 

We now give the precise forms of the polynomials in the Edgeworth expansion. As with 
the density, there will be both even and odd polynomials. These are not as compact or simple 
as the density case. Further, we will not attempt to simplify these functions by making use of 
limiting versions of moments. For example, we will not replace A P)1 by f(x) J ( Kr p ){u)u p+l du , 
and similarly for other pieces. The only simplification made will be the use of qk,us(z) in the 
expansion for T\> c , which otherwise would require further notation than what is below (along 
the lines of pi jUS (^) below). 

First, define the following functions, which depend on n, p , q , h, b , K and L, but this is 
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generally suppressed: 


CM = e' a T-\Kr r )(X hti )-, 

C( Xi) = &(*,) - / +2 eif; 1 A p , 1 e; +1 f 3 - I (Lr,)(X M ); 

C(.Y„ A'i) = ejf; 1 (E[(V»)M - (Kr t r' r )(X hJ )) t~\Kr r )(X Ki )- 
4W, *i) = C(X,, X,) - / +2 eify{ (E[(/Y tyr ;)(A' ftJ )] - (Kr/ p )(X hJ )) 

+ ((irr p )(A iJ .)A0‘ - E[(A> p )(A' p j) A'0 1 ]) e ; +1 
+ Ap.iePif " 1 (E[(Lr 8 r;)(A w )] - 

With this notation, we can write 

a 2 , = EirW-ffl], 

A 2 be = E[/r 1 4(A) 2 t.(A')l, 

I)u, = SE [fc-'OXOM*,) - r p (A, - i/ft,]] , 

and 

Db. = sE^-'CfAiJlmfX,) - r p+1 (A, - i)'/3 p+1 ] 

- A- 1 (C(A.) - C(A,)) (m(A.) - r,(X, - x)'/},] . 

We will define the Edgeworth expansion polynomials first for the undersmoothing case. The 
standard Normal density is (f>{z). First, the even polynomials are 

Pl,us(*) = 0{ z )Ks E [ /r ^!sK 2 - 2 - 1 )/ 6 } 

and 

P3,us(-) = ~H Z )Ks- 

The absence of p (2 )(z) is noteworthy: there is no version of this term for local polynomial 
estimation, because £* is conditionally mean zero. 

Next, the odd polynomials for undersmoothing are defined as follows: 

9i,us(2) = 0(^)d~ 6 E [h^C^Xife^] 2 {z 3 / 3 + 7z/A + al s z{z 2 - 3)/4} 

+ ^z)a ~ 2 E [h-XsiXiK^X^ {-z(z 2 - 3)/2} 
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+ <f>(z)a r J'E [h' 

-^)^"s 2 E \h 
-^)^”s 4]E h 
+ (t>(z)a~ 2 E 

+ </ > (-)^ us 4e 


-l/)0 


- 2«0 


+ 0(-)5-us 4e [ h ~ 

+ 0 (-)^us 4e |>" 
+ <t>W us 4]E [ /E 
+ </ > (-)^us 4E [ hr 


+ <f>(z)°u s E 


h- 


- V ^) 2 )] { z(* 2 - 3 )/ 8 } 

- 1 ^ s (X i ) 2 r p (X ftii )'f; 1 (Xr p )(X /l>i ) £l 2 ] {z(* 2 - 1)/2} 

E {z(z 2 - 1)} 

C (Xi) 2 {r p {X hi )'f ~ 1 (Kr p )(X h j )) 2 e 2 ] {z(z 2 - l)/4} 

(^O^C^O'r - 1 (/iTr*,,) (JfOr'pC^O'r - 1 (iifr p ) (JST fe , fe )^ (^r fc )ef 

x {^ 2 - l)/2} 

4 ^ s (^) M ] {--(- 2 - 3 )/ 24 } 

1 (CPQ)M^) - E Ks(^) 2 i'(A; : )]) C(Xi) 2 ef] {z(z 2 - 1)/4} 

2 C(^,^)C(^)C(^) 2 ^(^)] W* 2 - 3 )} 

2 C(^,^)4 S M Ks(^) 2 ^(^) - E [CWM^-)]) 4] {-4 

1 (£° s (Xi) 2 v(Xi) - E[^ s (A0 2 4^)]) 2 1 {-z(z 2 + 1)/8} ; 


? 2 ,us (z) = -cf>(z)a us 2 z/2; 

Q3,us(z) = 04)^us 4E [^ _1 C(^) 3£ i](4/3). 

For robust bias correction, both the even polynomials, pi X bc(z) and P3, r bc(4, and the odd 
polynomials, q i, r bc(4 92,rbc (4 > and 93 , r bc (4 are defined in the exact same way, but changing 
the cr us to Orbc, 4s(-) to ^c(-), K to L, and p to q, and so forth. 

The polynomials defined here are for distribution function expansions, and are different 
from those used for coverage error. The polynomials gi )US , 92 ,us, and g 3jUS and gi >rbc , 92 ,rbc, 
and qrrbc, which do not have an argument, used for coverage error in the main text and 
in Corollary 8 below, are defined in terms of those given above, which do have an argument. 
Specifically, the polynomials above should be doubled, divided by the standard Normal density, 
and evaluated at the Normal quantile z a / 2 , that is, 


Qk,m 



? 


Z=Z a/2 


k = 1, 2, 3, • = us, rbc 


For traditional bias correction, gi )US (^), (/ 2 ,us(4 and g 3 jUS (c) are used, but such simplifica¬ 
tion can not be done for p libc (4 and P 3 jbc (^), which must be defined as 

Pi,U*) = <KVK. 3 (^ [ft-'CfA'OVJ] {-(z 2 - l)/6} + E {-(* 2 - 3)/4}) 

+ <f>(z)ala £E [/r 1 C(.Y j ) 2 C(.Y j )4] {3(; 2 - l)/4} 
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and 


P3,bc(-) = -<f>(z)a^. 

Lastly, traditional bias correction also exhibits additional terms in the expansion (see 
discussion in the main text) representing the covariance of m and B m (denoted by Q]. bc ) and 
the variance of B rn (0 2 ,bc)- We now state their precise forms. These arise from the mismatch 
between the variance of the numerator of T bc and the standardization used, cr^ s , but these are 
random, and so Gi b c and be must be derived from the nonrandom versions, c^ bc and (cf. 
Section S.I.6; for the same reason ?/ us and ?/ bc must be nonrandom). Recalling the definitions 
above, 

< 2 , E[ft-1Q,(A')MA')] 

giatm + (<&(*•) - g,(A-))} 2 «(.Y)] 

= 1 - 2S- 2 E[ft- 1 {C(A')(C(A') - C(A))WA)] + S- 2 E[/r‘{(C(A') - C(X))} 2 v(X)] 
= 1 - 2p 1 ^ +1 >a- 2 E[h- 1 {p- p - 2 C(X)(C(X) - C(X))}v(X)} 

+ p 1+2 ^ 1 >ff-M6- , {p- p - 2 (C(X) - C(X))} 2 v(X)] 

Therefore 

fli.be = -29-, 2 E[h- 1 {p--- 2 C(X)(C(X) - t(A'))}!.(A')] 

and 

fl2,b= = »^nb- 1 {p-'- 2 (^(X) - C(A))} 2 t.(A')]. 

Remark 7 (Simplifications), ft is possible for the above-defined polynomials to simplify in 
special cases. A leading example is in the homoskedastic Gaussian regression model: 

Y; t = m(Xi ) + £j, where e* ~ N(0,v). 

This model is a common theoretical baseline to study, though over-simplified from an empirical 
point of view. In this special case, E[ef] = 0 and thus g 3iUS (z) = 0, entirely removing this 
term from the Edgeworth expansions. This has little bearing on the conceptual conclusions 
however, and in particular the comparison of undersmoothing and robust bias correction. 
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S.II.2 Details of Practical Implementation 


In the main text we give a direct plug-in (DPI) rule to implement the coverage-error optimal 
bandwidth. Here we we give complete details for this procedure as well as document a second 
practical choice, based on a rule-of-thumb (ROT) strategy. Both choices yield the optimal 
coverage error decay rate at interior and boundary points. All our methods are implemented 
in software available from the authors’ websites and via the R package nprobust available at 
https://cran.r-project.org/package=nprobust. 

As in the density case, the MSE-optimal bandwidth undercovers when used in the under¬ 
smoothing confidence interval; that is, Remark 1 applies directly. See also Hall and Horowitz 
(2013). 


S.II.2.1 Bandwidth Choice: Rule-of-Thumb (ROT) 

As with the density case, a simple rule-of-thumb based on rescaling the MSE-optimal band¬ 
width is: 


h int — f int „-(p-l)/((2p+3)(p+4)) 
rot mse 


and 


jbnd _ tbnd -p/((2p+3)(p+3)) 
rot mse 


where and denote readily-available implementations of the MSE-optimal bandwidth 
for interior and boundary points, respectively. See, e.g., Fan and Gijbels (1996). Again, 
when p = 1 in the interior, no scaling is needed (/i“£ = /i“*), but for p > 1 any data-driven 
MSE-optimal bandwidth should always be shrunk to improve inference at the boundary (i.e., 
reduce coverage errors of the robust bias-corrected confidence intervals). 

The ROT selector may be especially attractive for simplicity, if estimating the constants 
described below in the DPI case is prohibitive. 

Remark 2 applies to this case as well, though less transparently and without consequences 
that are as dramatic. 


S.II.2.2 Bandwidth Choice: Direct Plug-In (DPI) 

We now detail the required steps to implement the plug-in bandwidth h dp ± for interior and 
boundary points. We always set K — L, p — 1, and q — p + 1. The steps are: 

(1) As a pilot bandwidth, use h mse : any data-driven version of h* se . 
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(2) Using this bandwidth, estimate the regression function m(Xi ) as m(W; h mse ) = r p (Xi — 
x)'(3 p (h mse ), where f3 p ( h mse ) is the local polynomial coefficient estimate of order p exactly 
as defined in the main text, using the bandwidth h mse . 

Form £i = Yi- rh(Xi ; h mse ). 

(3) Following Fan and Gijbels (1996, §4.2) we estimate derivatives using a global least 
squares polynomial fit of order k + 2. That is, estimate ih( p+3 \x) as 

m (p+3) {x) = [y ] p+4 {p + 3)! + [y ] p+5 (p + 4)! x + [ 7] p+6 ^ ^ a; 2 , 

where [y] fc is the k-th element of the vector 7 that is estimated as 

n 

7 = arg min V' {Y l - r p+ 5 (W)'y ) 2 . 

7 SKp + 6 i=1 

The estimate for m^ p+ 2 ^(a;) is similar, with all indexes incremented down once. 

For interior points, both are needed, while only rh^ p+2 \x) is required for the boundary. 

(4) The estimated polynomials qk.rbc, k — 1,2,3 and the bias constants fj^ and r)^ d are 
defined as follows. The polynomials q\. r bc, <? 2 ,rbc, and g 3 )rbc , which do not have an 
argument, are defined in terms of those given in Section S.II. 1.2, which do have an 
argument. Specifically, the polynomials in Section S.II.1.2 should be doubled, divided 
by the standard Normal density, and evaluated at the Normal quantile z a / 2 , that is, 

Qtc.rbc 0 (^a/ 2 ) QUrbc (^a/ 2 ) • 

Note that with the recommended choice of K — L, p — 1, and q = p + 1, the polynomials 
qk, rbc, k — 1,2,3 can be read off the expressions for the undersmoothing versions, qk,us, 
k — 1, 2, 3, with p replaced by p + 1. 

The bias terms, for the interior and boundary, are given as follows. With q = p + 1, 
and hence even, and p — 1, the expressions of Section S.II.4 simplify. For the interior: 
r)lf = \fnlih p+3 fj ^. t ‘, with 

VbT = j™ +2 )\ { e o f P _1 ("V 2 “ K^p+^Ka) } 

772 (p+3) s ~ ~ _ \ 'v 

+ (p _|_ | e 0 r p 3 — \,l e p+l^q | > 
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At the boundary: //£“ d = Vnhh p+ 2 fj b ^ d , with 


vlT = ^yy {e'o ^ 1 (a p , 2 - Ap,i e p+i r g lA <?,i) } • 

The estimates of these, qk,rbc k — 1,2,3 and fj^R and ?)£“ d , are defined by replacing: 

(i) h with h mse , 

(ii) population expectations with sample averages (see note below), 

(iii) residuals £i with i t , 

(iv) derivatives m^ p+2 ^ and m^ p+3 ^ with their estimators from above, 

(v) limiting matrixes T p , A p>2 , etc, with the corresponding sample versions using the 
bandwidth h mse , e.g., f p is replaced with F p (/i mse ) = R' p W p (h mse )R p /n, where W p (h mse ) = 
diag (h~ s \K ((A* - x)/h mse ))- 

(5) Finally h™l = J ff d ^(h mse )n" 1/ ( p+4 ) and h^ d = H^ d (h mse )n" 1/(p+3) , where 

H£t(h mse ) = argmin| H^q hrbc + H 1+2 ^ +3 \f,tT)% , rbc + H*+ 3 bc |, 

H 

while at (or near) the boundary the optimal bandwidth is h* bc = H* bc (p)n~ 1 ^ p+3 \ where 
H%t(K se ) = argminj H-%^ + H 1+2 ( p+2 \fj£ d ) 2 q 2 , rbc + H p+2 (^ d )q, 3 , rbc |. 

H 

These numerical minimizations are easily solved; see note below. Code available from 
the authors’ websites performs all the above steps. 

Remark 8 (Notes on computation). 

• When numerically solving the above minimization problems, computation will be greatly 
sped up by squaring the objective function. 

• For step 4(ii) above, in estimating qi rbc , and specifically when replacing population 
expectations with sample averages, we use the appropriate [/-statistic forms to reduce 
bias. There are several terms which are expectations over two or three observations, and 
for these the second or third order [/-statistic forms are preferred. For example, when 
estimating terms such as 

E[k-X,(Xp(r r (X hti )t;\Kr p )(X hJ )) 2 Ei 
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we use 


IT—n E E [k?A c (X>) 2 (r r (X L J'r;\Kr r )(X- K 'J)H) 

K ’ i= 1 m 


where £° bc (A 7 " i ) is made feasible as in step 4(v). 


S.II.2.3 Alternative Standard Errors 

As argued in the main text, using variance forms other than ( 20 ) and ( 21 ) can be detrimental 
to coverage. Within these forms however, two alternative estimates of E are natural. First, 
motivated by the fact that the least-squares residuals are on average too small, the well-known 
ffC k class of heteroskedasticity consistent estimators can be used; see MacKinnon (2013) for 
details and a recent review. In our notation, these are defined as follows. First, <9^ S -HC0 is 
the estimator above. Then, for k = 1,2, 3, the estimator is obtained by dividing 

sj by, respectively, (n - 2 tr (Q p ) + tr (Q' p Q p ))/n, (1 - Q Ptii ), and (1 - Q P) u) 2 , where Q p>ii is 
the i-t\i diagonal element of the projection matrix Q p := R' p Y p l R! p W p /n. The corresponding 
estimators a 2 bc -B.Ck are the same way, with q in place of p. As is well-known in the literature, 
these estimators perform better for small sample sizes, a fact we confirm in our simulation 
study below. 

A second option is to use a nearest-neighbor-based variance estimators with a fixed number 
of neighbors, following the ideas of Muller and Stadtmuller (1987); Abadie and Imbens (2008). 
To define these, let J be a fixed number and j(i ) be the j-th closest observation to X b, 
j = 1,..., J, and set v(Xi) = 7 ^-(Yj — Ylj=i E'(*)/-0 2 - This “estimate” is unbiased (but 
inconsistent) for v(Xj). 

Both types of residual estimators could be handled in our results. The constants will 
change, but the rates will not. This is because, in all cases, the errors in estimating v(Xi) are no 
greater than in the original m(x). Inspection of the proof shows that simple modifications allow 
for the HC/c estimators: only the terms of Eqn. (27) will change, and indeed, we conjecture 
that the HC k estimators will result in fewer terms and a reduced coverage error. This is 
consistent with the improved finite-sample behavior of these estimators and the fact that they 
are asymptotically equivalent. Accommodating the nearest-neighbor estimates require slightly 
more work and a modified version of Assumption S.II.3.3. 

One crucial property of our method, in the context of Edgeworth expansions, is that the 
bias in estimation of E is of the same order as the original rh(x). Using other methods may 
result in additional terms, with possibly distinct rates, appearing in the Edgeworth expansions. 
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Some examples that may have this issue are (i) using v(Xi) = (Y) — rh(x)) 2 ; (ii) using local 
or assuming global heteroskedasticity; (iii) using other nonparametric estimators for v(Xf), 
relying on new tuning parameters. 


S.II.3 Assumptions 

Copied directly from the main text (see discussion there), the following assumptions are 
sufficient for our results. 

Assumption S.II. 3.1 (Data-generating process). {(Yi, Ad),..., (Y n , X n )} is a random sam¬ 
ple, where X\ has the absolutely continuous distribution with Lebesgue density f, E[Y' 8 +( 5 |A] < 
oo for some 6 > 0 , and in a neighborhood of x, f and v are continuous and bounded away 
from zero, m is S > q + 2 times continuously differentiable with bounded derivatives, and mf S) 
is Holder continuous with exponent g. 

Assumption S.II.3.2 (Kernels). The kernels K and L are positive, bounded, even functions, 
and with compact support. 

Assumption S.II.3.3 (Cramer’s Condition). For each 6 > 0 and all sufficiently small h, the 
random variables Z us (u ) and Z Thc (u) defined below obey 



sup 

teM dim { z ( u )>,||i||>5 


where C(x,S) > 0 is a fixed constant, ||f || 2 = Y^=\ Z ^ ^d.) anc ^ ' l = V~l- 


The random variables of Assumption S.II.3.3 are defined follows. For two kernels Kf and 


K‘ 2i two polynomial orders (i.e. positive integers) pi and P 2 , a bandwidth b , and a scalar p, 


let 



and 



vech(K 1 (u)K 2 (up)r Pl (u)r P2 (upyv(x - ub ))', 

vech(A'i(w) K 2 (up)r Pl (u)r P2 (upye(m(x — ub) — r P2 (ub)' fd P2 ))', 

vech (K 2 (u) 2 r P2 {u)r p2 (■ u)'r P2 (u)')', 
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vech(Ki(u) K 2 (up)r pi (u)r P2 {up)'r P2 (u)'e)', 

vech(K 1 (u)K 2 (up)r Pl (u)r P2 (up)'r P2 (up)'£(m(x - ub) - r P2 (ub)' (5 P2 ))' j . 


The subscripts are intended to make clear that Z m (•) collects quantities from the numerator 
of the Studentized statistic, while Z a (-) gathers additional variables required for the variance 
estimation. With this notation, we define 


Z us (u) = ( Z m (u ; K,p,p, h, 1)', Z a {u\ K, K,p,p, h, l)')', 

Z bc (u ) = (Z m (u;K,p,p+l,h,l)', Z m (u-L,q,q,b,p)\ vech (K (u)r p (u)u p+1 )', Z a (u; K, K,p,p, h, 1)')', 
and 


Z Tbc (u) = (Z m (u;K,p,p+l,h,l)', Z m (u-,L,q,q,b,p)', vech (K(u)r p (u)u p+1 )', 

Z a (u ; K , K,p, q, b, p)\ Z a (u; L, L, q, q, b, 1)', Z a (u; K , L,p, q, b, p)')'. 


This notation is quite compact, and while it emphasizes the simplicity of Cramer’s condi¬ 
tion and the fact that it puts mild restrictions on the kernels, it does obscure the full notational 
breadth, particularly for Z Tbc . This is mostly repetitive: what holds for the kernel K and order 
p fit must also hold for L and q, and for their squares and cross products. To make this clear, 
we can expand all the Z m and Z a , to write out the full statistics 


Z u s (w) 


(k( u)r p (u)'£, K(u)r p (u)'(m(x — uh ) 


r p (uh)'/3 p ), vech (K(u)r p (u)r p (u)')', 


vech (K(u) 2 r p (u)r p (u)'e 2 )', vech (K(u) 2 r p (u)r p (u)'v(x — uh))', 

vech {K{u) 2 r p {u)r p {u)'e(m(x — uh) — r p (uh)'f3 p ))', vech (K(u) 2 r p (u)r p (u)'r p (u)')\ 


vech (K(u) 2 r p (u)r p (u)'r p (u)'e)', vech (K(u) 2 r p (u)r p (u)'r p (u)'£(m(x 



Z b c (u) 


(' u)r p (u )'£, vech {K(u)r p (u)r p (u)')', 


vech (K(u) 2 r p (u)r p (u)'e 2 ) 1 , vech (K(u) 2 r p (u)r p (u)'v(x — uh ))', 
vech (K(u) 2 r p (u)r p (u)'£(m(x — uh) — r p {uh)' (3 P ))', vech (A' (u) 2 r p (u)r p (u)'r p (u)')', 
vech(K(u) 2 rp(u)r p (u)'r p (u)' e)', vech.(K(u) 2 r p (u)r p (u)'r p (u)'£(m(x — uh) — r p (uh)'f3 p ))', 
K(u)r p (u)'(m(x - uh) - r p+1 (uh)'/3 p+l ), L(up)r q (up)'e, vech (L{up)r q (up)r q (up)')', 


vech(K(u)r p (u)u p+1 )', L(up)r q (up)'(m(x — uh) 


(u 
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and 


Z Ihc (u) = (z hc (u)', vech (K(u) 2 r p (u)r p {u)'E 2 )'i vech(AT (u) 2 r p (u)r p (u)'v(x - ub))', 

vech(K(u) 2 r p (u)r p (u)'e{m{x — ub) — r q (ub)' j3 q ))', vech (K(u) 2 r p (u)r p {u)'r q {up)')\ 
vech (K(u) 2 r p (u)r p (u)'r q (up)'e)', vech [K {u) 2 r p {u)r p {u)' r q {up)' £{m(x — ub) — r q (ub)' /3 q )y 
vech {L{u) 2 r q {u)r q {u)'e 2 )', vech {L{u) 2 r q {u)r q {u)'v(x — ub ))', 
vech(L(-u) 2 r, ? (-u)rq(M) / £(m(a; — ub) — r q {ub)' (3 q ))' , vech(L(u) 2 r q (u)r q (u)'r q (u)')', 
vech (L(u) 2 r q (u)r q (u)'r q (u)'£)', vech (L(u) 2 r q (u)r q (u)'r q (u)'E(m(x — ub) — r q {ub)'(3 q ))', 
vech (K(u)L(up)r p (u)r q (up)'£ 2 )', vech (K(u)L(up)r p (u)r q (up)'v(x — ub))', 
vech (K(u)L(up)r p (u)r q (up)'e{m(x — ub) — r q {ub)'/3 q ))', vech (L(u) 2 r q (u)r q (uyr q (uy)', 
vech (K(u)L(up)r p (u)r q (up)'r q (uy e)', 

vech (K(u)L{up)r p (u)r q (up)'r q (upy£{rn(x — ub) —r q (ub)'/3 q )y j . 

Remark 9 (Sufficient Conditions for Cramer’s Condition). Assumption S.II.3.3 is a high 
level condition, but one that is fairly mild, ft is essentially a continuity requirement, and 
is discussed at length by (among others) Bhattacharya and Rao (1976), Bhattacharya and 
Ghosh (1978), and Hall (1992a). For a recent work in econometrics, the present condition can 
be compared to that employed by Kline and Santos (2012) for parametric regression (the role 
of the covariates is here played by r p (X h l )): ours is more complex due to the nonparametric 
smoothing bias and the fact that the expansion is carried out to higher order. 

It is straightforward to provide sufficient conditions for Assumption S.II.3.3, given that 
Assumptions S.II.3.1 and S.II.3.2 hold. In particular, if we additionally assume that 

(1, vech(K(u)r p (u)r p (u)'), vech {K{u) 2 r p {u)r p (u)'r p {u)')') 

comprises a linearly independent set of functions on [—1,1], then it holds Z us (u) has compo¬ 
nents that are nondegenerate and absolutely continuous, and this will imply that Assumption 
S.II.3.3 holds for Z us (u), by arguing as in Bhattacharya and Ghosh (1978, Lemma 2.2) and 
Hall (1992a, p. 65). This is precisely the approach taken by Chen and Qin (2002), when 
studying undersmoothed local linear regression. If the linear independence continues to hold 
when the set of functions is augmented with vech (L(u)r q (u)r q {u)'), then Z bc (u) satisfies As¬ 
sumption S.II.3.3 as well. To obtain the result for Z Tbc (u) requires that linear independence 
hold for 

(l, vech(K(u)r p (u)r p (u)'), vech(K(u) 2 r p (u)r p (u)'r q (u)')', vech (L{u)r q {u)r q (u)'), 
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vech (L(u) 2 r q (u)r q (u)'r q (u)'y, vech (K(u)L(up)r p (u)r q (up)'r q (up)')'). 

At heart, these are requirements on the kernel functions, just as in Assumption S.I.3.3 in 
the density case. The uniform kernel is again ruled out. See Remark 5. Further, note that if 
these sets of functions are not linearly independent, there will exist a there exists a smaller 
set of functions which are linearly independent and can replace the original set while leaving 
the value of the statistic unchanged (see Bhattacharya and Ghosh (1978, p. 442)). 

In sum, this makes clear that Assumption S.II.3.3 is quite mild. ■ 

Finally, the precise random variables Z ns (u), Z hc (u ), and Z Tbc {u) used can be replaced with 
slightly different constructions without altering the conclusions of Theorem 5: there are other 
potential functions T that satisfy Eqn. (23) in the proof. Such changes necessarily involve 
asymptotically negligible terms, and do not materially alter the severity of the restrictions 
imposed. 

S.II.4 Bias 

We will not present a detailed discussion of bias issues, along the lines of Section S.1.4.1, for 
brevity; we focus only on the case of nonbinding smoothness. 

The biases ?/ us and r/ hc are not as conceptually simple as in the density case. The closest 
parallel to the density case would be (for example) ?/ us = Vnh{K[m] — m), but this can not 
be used due to the presence of T p 1 inside the expectation, and the next natural choice, the 
conditional bias \/nh(E[m\Xi ,... X n ] — m), is still random. Instead, rj us and ?/ bc are biases 
computed after replacing T p , r g , and A Pi i with their expectations, denoted T p , T 9 , and A Pj i. 
We thus define 

r] us = \fiih J e^fp 1 K{u)r p (u) ( m(x — uh ) — r p (uh)'(3 p ) f(x — uh)du, 
r] bc = Vrti j e^Tp 1 K(u)r p (u) (■ m(x - uh) - r p+1 {uh)'p p+1 ) f(x - uh)du 

— Vnhff +1 / egf ^A p ^e' p+l t~ l L(u)r q (u) (■ m(x — ub ) — r q (ub)'(3 q ) f(x — ub)du. 

( 22 ) 

For the generic results of coverage error or the generic Edgeworth expansions of Theorem 
5 below, the above definitions of r] us and ?/ bc are suitable. For the Corollaries detailing specific 
cases, and to understand the behavior at different points, it is useful to make the leading terms 
precise, that is, analogues of Equations (10) and (11). We must consider interior and boundary 
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point estimation, and even and odd q. We depart slightly from other terms of the expansion 
in that we do retain only the leading term for some pieces. This is done in order to capture 
the rate of convergence explicitly and to give practicable results. These results are derived 
by Fan and Gijbels (1996, Section 3.7) and similar calculations (though our expressions differ 
slightly as hxed-n expectations are retained as much as possible). 

Since p is odd, both at boundary and interior points we have 

— Tn A+b „ _ 

Vns = Vnhh p+1 + Yyj e Q r p 1 Ap l i [1 + o(l)]. 


Moving to Tjbci consider the hrst term, which in the present notation is: y/nKE[h 
r p+1 (X — x)'fi p+ i)\. With p + 1 even, we find that in the interior the leading terms are 


Vnhh p+3 e' 0 f- 1 


/ m (p+ 2) m (p+3) \ 

( V2 + FW H 


[1 + o(l)], 


due to the well-known symmetry properties of local polynomials that result in the cancellation 
of the leading terms of T p 1 and A Pj 2 - The rate of h p+3 accounts for this. At the boundary, no 
such cancellation occurs and we have only 


. 77oV/'T*/ 

^ hP+ \p + 2)\ e '° TplAp ’ 2 f 1 + ’ 


Next, turn to the bias of the bias estimate: 




L(u)r q (u) (m(x — ub ) — r q (ub)'j3 q ) f(x — ub)du. 


If q is odd (so that q — (p + 1) is also odd), then at the interior or boundary the leading term 

will be 

- 777 (9+1) ~ ~ ~ - 

\/nhb q+1 p p+l + i ^ eor~ 1 A p ,iep +i r~ 1 A g .i [1 + o(l)] x Vnhh p+1 b q ~ p . 

The same expression applies at the boundary for q even. However, for the interior, if q is 
even, then we again have cancellation of certain leading terms, resulting in the bias of the bias 
estimate being 

V^b q+2 p p+1 e\'^Ap^e'p^T- 1 ^ g + 1 ^, A g ,i + + 2 ), t 1 + °(!)] x Vnhh p+1 b q+1 ~ p . 

Combining all these results, we hnd the following (dropping remainder terms): for an interior 
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point, with q even, 



with q odd, 


Vnhh p+3 |e[ ) fp 1 ^ 



and finally at a boundary point, for any q, 



S.II.5 Main Result: Edgeworth Expansion 

We now state our generic Edgeworth expansion, from whence the coverage probability expan¬ 
sion results follow immediately. We have opted to state separate results for undersmoothing, 
bias correction, and robust bias correction, rather than the unified statement of Theorem 3, 
for clarity. The unified structure is still present, and will be used in the proof of the result 
below, but is too cumbersome to use here. The Standard Normal distribution and density 
functions are <h(^) and f>(z), respectively. 

Theorem 5. Let Assumptions S.II. 3.1, S.II.3.2, and S.II.3.3 hold, and assume nh/ log(n) —)• 
oo. 

(a) If r] ns log (nh) —>■ 0, then for 



we have 


sup |P[T US < z] - F us (z )| = o ((nh) 1 + (nh) 1 / 2 r/ us + r/ 2 s ) . 


(b) If rj hc log (nh) —» 0 and p —> 0, then for 
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we have 


P p+2 (n 1 + pp +1 n 2 )^-z, 


sup |P[T bc <z]~ F bc (z)\ = o (( nh) 1 + {nh) 1/2 ?/ bc + p bc + p 1+2(p+1) ) . 


(c) If r/ bc log (nh) 0 and p —>• p < oo, then for 


1 1 Vbc 

F lbc (z) = $(-) + ~==Pl,rbc(-) + fj hc p 3 , rhc (z) + — gi,rbc(^) + ^ bc g2,rbc(-) + -y==<? 3 ,rbc {z) , 
Vn/i nh y/nh 


we have 

sup |P[T rbc < 2 ] - F rbc {z)\ = o (( nhy 1 + (5 nh)~ 1/2 fj bc + r) bc ) . 

zSK 


S.II.5.1 Coverage Error for Undersmoothing 

For undersmoothing estimators, we have the following result, which is valid for both interior 
and boundary points, with moments appropriately truncated if necessary. This result is the 
analogue of the robust bias correction corollary in the main text, and follows directly from the 
generic theorem there or Theorem 5 above. Exponents such as 1 + 2 (p + 1) are intentionally 
not simplified to ease comparison to other results, particularly the density case. 

The polynomials gi, us , 92 ,us, and g 3iUS , which do not have an argument, are defined in 
terms of those given in Section S.II. 1.2 and used in Theorem 5, which do have an argument. 
Specifically, the polynomials in Section S.II. 1.2 and Theorem 5 should be doubled, divided by 
the standard Normal density, and evaluated at the Normal quantile z a / 2 , that is, 


Qk,u s • ,/ \ ?i,us(*) 

(j){z) 


k = 1,2,3. 


Z=Z a /2 

Corollary 8 (Undersmoothing). Let the conditions of Theorem 5(a) hold. Then 

1 

I m 


E / us ] = 1 - a + < — q 1>UB + nh 1+2{p+1) (m (p+1) ) 2 (V 0 T p 1 A Pl i/(p + l)l) 


<?2, u 


+ h p+1 (m [p+l) ) (e , 0 f p 1 A P) i/(p+ l)l) g 3 ,us U(*f ) (l + o(l)}. 


In particular, if h* s = H* s n 1 /( 1 +h ,+1 )) ; then P [m G I us \ = 1 — a + 0{n (p+ 1 )/( 1 +(p+ 1 ))) j where 
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H* s = arg min 

H 


H-'q i.„, + H'+ 2 «'+i> (e' 0 f; 1 A p , 1 /(p+ 1)!)% 2 ,„ 

+ H» +1 (m<' +1 >) (e' 0 f- 1 A p , 1 /(p+ l)l) 93i , 


S.II.6 Proof of Main Result 

We will first prove Theorem 5(a), as it is notationally simplest. From a technical and con¬ 
ceptual point of view, proving the remainder of Theorem 5 is identical, simply more involved 
notationally due to the additional complexity of the bias correction. Outlines of these proofs 
are found below. 

S.II.6.1 Proof of Theorem 5(a) 

Let s = \fnh. 

Throughout this proof, we will generally omit the subscripts us and p when this causes 
no confusion. This entire proof focuses on the undersmoothing statistic, T us = a~fs(m — m), 
and since bias correction is not involved at all, the associated constructions such as T g , W q , 
etc, do not appear, and hence there is no need to carry the additional notation to distinguish 
W p from W q , or d us from (X rbc , for example, and we will simply write T for T p , W for W p , cr 
for cf us , etc. 

Our goal is to expand P[T US < z], where T us = d _1 s(m — m). The proof proceeds by 
identifying a smooth function T = T(z) such that, for the random variable Z us := Z us (w) that 
obeys Cramer’s condition (Assumption S.II.3.3), T(E[Z US ]) = 0 and 

P[Tus < z\ = P[T(Z US ) < 5] + o(s^ 2 + s~ l p + if), (23) 

where Z = ^)" =1 Zi/n and 5 is a known, nonrandom quantity that depends on the original 
quantile z and the remainder T us — T (see Remark 9). An Edgeworth expansion for T holds 
under Assumption S.II.3.3, and a Taylor expansion of this function around z yields the final 
result. As in the density case, z will capture the bias terms of T us : in that case z = z — i]/a, 
but here bias is present in both the numerator and the Studentization. 

To begin, define the notation R = [r p {Xi — x), ■ ■ ■ , r p (X n — x )] 7 and M = [m{X±), ..., m(X n )]', 
and use this to split T into variance and bias terms, as follows: 

T = d~ l R’W(Y - M)/n + a -1 se! Q T~ l R!W(M - R[3)/n. 
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We use this decomposition to rewrite P[T us < z] as 


P [T us < z] = P [T us - a 1 rj < z — a 1 rj] 


= P [{a^se'oT^R'WiY - M)/n + a~ x se' 0 T~ l R!W (M - R/3)/ 


n — o 


r S} < 


z — a 


r W 


= P 


a~ l se'^Y -1 R!W (Y — M) / n 

+ d- 1 se / 0 f“ 1 J R / W(M - Rf3)/n - d~ x r) 

+ a~ l se [3 (V " 1 - f" 1 ) R!W{M - R/3)/n 

+ (a- 1 - b” 1 ) se' o r- 1 J R / W(F - M)/n 

+ (b _1 - b _1 ) seor- 1 J R'W(M - R(3)/n\ < z - a^rj 


(24) 


The first three lines in the last equality obey the desired properties of T by the orthogonality 
of e„ the definition of r] us in Eqn. (22) as E se , 0 f~ 1 R , W(Ad — R(3)/n , and the fact that 

T" 1 — f _1 = f -1 — rj T" 1 . For the final two (which are T us — d~ l s{rh — m) = b _1 — 

a~ 1 s(rh — m)), we must expand the difference b _1 — b -1 . Accounting for the resulting terms 
will constitute the bulk of the remainder of the proof, as well as complete the construction of 
z and the remainder terms of Eqn. (23). 2 

To begin, with b 2 = e' 0 f _1 T , f^ 1 e 0 defined in Section S.II.1.2, 


cr cr \ cr 


uuty v U(ia- 


cr 


cr 


2 \ - 1/2 


CT Z 


and hence a Taylor expansion gives 


1 _ 1 
b b 


i_W 


cr 2 3 / <J~ 
;-1 - x 


<7^ 


cr 


2 \ 2 


(7 


1 15 fa 2 
3! "8 


~2 \ 3 ~ 7 
<7 \ <7 


o- 


< 7 ‘ 


for a point cr 2 G [a 2 , b 2 ], and so 


2 A 


a 


--1 


_ 1 1 b 2 — cr 2 3 (b 2 — b 2 ) 5 (b 2 — cr 2 ) 

=-x—xx-h x-xr- 


cr^ 


7° 


16 


cr 7 


(25) 


We thus focus on b 2 — cr 2 . Recall the definition of T = hR'WTiWR/n. Then define the two 

technically, to obtain a T with the desired properties, one need not expand b -1 —b _1 for the variance term: 
that is, in Eqn. (24), a~ 1 se' o r~ 1 R'W(Y — M)/n and (b _1 — b _1 ) se' 0 Y~ l R'W(Y — M)/n may be collapsed. 
This requires strengthening Cramer’s condition (see Remark 9), and since b^ 1 — b -1 must be accounted for 
in the final bias term, (b _1 — b _1 ) se' 0 Y~ 1 R'W(M — Rf3)/n , there is little reason not to do both terms. 
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terms A\ and A 2 through the following: 


d 2 - d 2 = e'or” 1 (V - r-^o + e^r-^r-^o - e'or-^r-^o =: A + A 2 . (26) 


For Ai, recall that i t = yt — r p (Xi — x)'f3 p and so 

* - * = A jr(K V ' p )(x h ,i) {i? - v(x, )} 


nh 

1 

nh 

1 

nh 


7=1 

n 


( 2 

^2( Kr P r p)( x h ,i) l (y* - r p( x i - x )'pp) - v ( x i) 

i= 1 ^ 

n ^ 

y^(Kr p r' p )(X Ki ) l Ui + [m(Xi) - r p (Xi - x)'f3 p \ + r p (Xi - x)' 


i=i 


ftp fir, 


1 N 2 


- v(Xi) 


—■ Ai t i + A i ; 2 + ^i,3 + Ai : 4 + A\$ + ^1,6 + Aij + v4i ? 8. 


(27) 


where 

1 n 

^1,1 = 55 X Kr AU x ».‘) {«? - »'«)}, 

is due to the approximation of the (average over the) conditional variance by the squared resid¬ 
uals (i.e. Ai i is the sole remainder that would arise if the true residuals were known and used 
in place of if), and, using r p (X t — x)'fl = r p (X t — x)'H p T~ l R'WY/n = r p (X hti )'T~ l R!WY/n, 
the terms Ai$, k = 2,3,... ,8 are: 


A 

A 

A 

A 

A 

A 


1,2 


1 

nh 


1,3 — 


nh 


1,4 — 


nh 


1,5 — 


1,6 


1,7 


nh 

1 

nh 

1 

nh 


Y,(Kr p r' p )(X Kl ) {2£ i [m(X i ) - r p (X, - x)'fi p \} , 

7=1 

n 

J2( K: r P r' P ){X h ,i) {-2 £ir p {X h ^} F^R'W(Y - Rf3)/n, 

7=1 

77 

J2( K V P )(A<) {-2 [mix) - r p (Xi - X )'P P ]r p (X h J} T^RiWiY - M)/n, 

7=1 

77 

^(A> p r;r;)(X ft .,,)r- 1 J R , fF(y' - XI)/n [(Y - M)'/n + 2 (M - Rf3)/n ] WRT~\{X h ^ 

7=1 

77 

^2(Kr p r p )(X hti )[m{Xi) - r p (Xi - x)'(3 p } 2 , 

7=1 

77 

^(ATr p r;r;)(X hii ) |-2[m(X i ) - r p (X 4 - x)'/9 p ]} - A/3)/n, 

7=1 
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and 


1 

A lfi = ^(Krpr'/^X^T-^R'WiM - RP)/n][(M - Rf3)'/nWRjT^r^X^). 

i=1 

With this notation, we can write A± = e' G T~ l — T j T _1 eo = e^T -1 (X)fc=i A\,k) T _1 eo. The 
terms Au to Ai )5 will be incorporated into T : notice that these terms obey A\^ = Ai^(Z us ) 
and J 4i i / C (E[Z US ]) = 0, and hence these properties will be inherited in the final two lines of Eqn. 
(24). However, A 1i6 , A li7 , and A 1j8 do not have these properties, and will thus be incorporated 
into 5 and the remainder. Details are below. 

Turning to A 2 in Eqn. (26), using the identity T _1 — f ” 1 = f _1 — T^ T _1 and that T 

and T are symmetric, we find that 


A 2 = e^T-^T-^o - e^f-^f-^o 

= e' 0 T- 1 (i> - ) r _1 eo + e' 0 (t” 1 - f" 1 ) tff^eo + e' (r” 1 - f” 1 ) 

= e' 0 T’ 1 ($ - r _ 1 e 0 - e'of” 1 (t - f) T" 1 * (V " 1 + f e 0 . 


All of these terms obey the required properties of T. 

We now collect the terms from expanding a~ 1 — cD 1 and return to Eqn. (24). Plugging 
the terms Ai^-Ai^ and A 2 into the Taylor expansion in Eqn. (25), by way of Eqn. (26), and 
collecting terms appropriately (i.e. those that belong in T as described above), we have the 
following, which picks up from Eqn. (24) and is a precursor to Eqn. (23): 


P[T US < z] 


P 


T(Z US ) + U < z 


(28) 


In this statement, we have made the following constructions: 


T = a- l se’ 0 T- l R'W(Y - M)/n 

+ a -1 se' 0 T~ l R'W (M - Rfi)/n - a^r) 


+ cr se 0 
1 

+ ' 


— I p — 1 _ p — 1 


R'W(M — R/3)/n 


2a 3 L 


oT 1 ( i A ljk J T 1 e 0 + A 2 + — [e'T 1 Ai jl T 1 e 0 + A 2 ] 


x \se' Q Y- 1 R!W{Y - M)/n + se' o r” 1 i? , W(M - R(3)/n \, 
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2 


u 


2d 3 


e' 0 r (Ai )6 + a 1i7 + a 1j8 ) r 


-1 


e “ + 8s! 


ejr 


-1 


V 8 A u 


r~ 


e 0 


x {se'^flWiY - M)/n + s^T^RfWiM - RP)/n} 

| — 1 (^i,6 + Aij + Aggj f 1 eo| rj, 


5 (a 2 - a 2 ) 3 
16 a 7 


and 


z = z — 



1 

2d 3 


e'r 


-l 



+ Aij 




V- 


In U and z, each A\^ is A\^ where all elements have been replaced by their respective fixed-rt 
expected values, that is, 


A lfi = E[A 1>6 ] = E h-\Kr p r'MX Ki ) [m(X,) - r p (X, - x)'j3 p \ 


Au = -2E h-\Kr p r'r'){X hii ) [m(X,) - r p (X, - x)'P p ] 


x f _1 E 


hr\Kr r )(X hJ ) [m(. Xj) - r r (X,-x)'p r ] 


and 


^1,8 — E 


h-\Kr p r' p ){X h ^ E 


h- l r p (X h Jt-\Kr p )(X h j) [m(^) 


r P (Xj 


x)'P P ] 



The next step in the proof is to show that, for r* = max{s 2 , ')] 2 ,h p+1 } (i.e., the slowest 
decaying), it holds that 

—P[|C/| > r n \ —>■ 0, for some r n = o(r*). (29) 

r* 

This result is established by Lemma 7 in Section S.II.6.3 below. This, together with Eqn. 
(28), implies Eqn. (23). 

Under Assumption S.II.3.3, an Edgeworth expansion holds for T up to o(s~ 2 + s~ l r} + rf). 
Thus, for a smooth function G(z), we have P[T < z\ — G(z ) + o(s~ 2 + s _1 ? ] + rj 2 ). Therefore, 
a Taylor expansion gives 

P[T < z\ — G(z) — G^\z) /d 1 ~ 2 ^ 3 e o^ 1 (^i,6 + A-ij + Aqg^ T 1 eo|-+o(s " + s 1 r]+r] 2 ), 

which together with Eqn. (23) establishes the validity of the Edgeworth expansion. The terms 
of the expansion are computed in Section S.II.6.4 below. □ 
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S.II.6.2 Proof of Theorem 5(b) &; (c) 

To prove parts (b) and (c) of Theorem 5 the same steps are required, and so we will not pursue 
all the details here. Indeed, the same expansions are performed and the same bounds computed 
on objects which are conceptually similar, only taking into account the bias correction (in the 
numerator for (b), and also in the denominator for (c)). The bias correction will result in 
essentially two changes: first, many more terms like T — T appear, and second, the bias 
expressions and rates change. To illustrate, we will list several key points where these changes 
manifest. This list is not exhaustive, but it will show that the same methods used above still 
apply. 

First, for the numerator of T bc and Tbc, recall that the estimator m is 

m = {e'cT^R'pW^Y/n, 

while the bias corrected estimator is 

rh-B m = {e' 0 T; 1 ( R' p W p - ff^A^e'^T^R^W^Y/n. 

Comparing these two expressions, it can be seen that the terms in the proof above that 
involve T p — T p will now additionally involve T 9 — T g and A P)1 — A p l , whereas those that with 
e' 0 f p 1 R' p W p will now have e' 0 f - 1 (^R' p W p — p p+1 A Pi ie p+1 f g 1 R! q W^j instead. To give a concrete 
example, consider the third line of Eqn. (24), 

v^se'o (r; 1 - r; 1 ) R' p W p (M - R p p p )/n , 

which becomes a piece of the function T. For part (b) Theorem 5, treating T bc , this will 
become 

(u 1 - fp) r' p w„(m - 

- se!y+' (ryve^.r- 1 - fyApjeyy- 1 ) R’ q W,(M - RM/n, 
and part (c) will have the same but with a~^ c . Then, since 

r ; lA P,i4+i r y - h'Ai^+ify = (y 1 - fp- 1 ) A p ,,ey,ry 

+ h ( A p,i — A p,i) e p+ip + fp Ap iPp + 1 (l — fg ) i 

this term is handled identically, since the appropriate Cramer’s condition is assumed. 
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Consider now the denominator of the Studentized statistics. For part (b), there is no 
change as a^ s is still used, and so the terms involving and A 2 will be identical. However, 
for T rbc , we must account for changes of the above form, but also that the residuals are 
estimated with the degree q fit: ii = yi — r q (Xi — x)'[3 q instead of degree p. With these 
changes in mind, the analogue of Eqn. (26) will be 


a. 


rbc 


— G. 


rbc 


— e' F _1 

— c o L p 


(\ - %) + 



- 1 * r 

p ^ q L 


-l 

v 


/ T~' — 1 

e o - eoP 



(30) 


The second term will proceed as above, though T p — T p will be replaced by 


1 

F “ b = ^ E {hM&MXXi) - E &(*,)&( X>jv(X i ) 

2=1 


where f? bc (W) = (Kr p )(Xh,i) — (f +2 K p pT q 1 (Lr p )(pXh ti ) (cf. Section S.II.1.2, the function £ bc 
therein is f bc (W) = e' 0 T p 1 £ bc (X i )). To use similar notation, 


db 


i n 

b = ^E 

2=1 


XtfviXi) - E P„(X,)P„( XtfvlXi) 


Then, expanding (^(A 7 *) shows that Tg — is equal to 


1 

% - b + /> (P+1)+1 bT s _1 ^ E {(iviHWiMV) - E [(Lr,r' q )(X„MX,)] } f 

2=1 

1 n 

- p (p+1)+1 2— £ {(Kr r )(X h , i )(Lr'J(pX h _MXi) ~ E [(Kr r )(X h ^(Lr' q )(pX h MXi )\} rpAj.i, 


2=1 


and since all these terms still obey the appropriate Cramer’s condition, the same steps apply. 

The first term of Eqn. (30) will also follow by the same method as in the prior proof, 
but more care must be taken as many more terms will be present because — Tg consists 
of the following three terms, representing the variance of rh, the variance of B m) and their 
covariance, respectively: 


- Tg = hR' p W p (Eg - e) W p R p /n 

+ hp 2(p+l) A p jT- 1 (R! q W q £ q W q R^ T~ l X pl /n - hp 2 ^ A^f" 1 {R! q W q YW q R q ) f q l X p ,/n 
- 2hp P+1 R' p W p (SgWgtfgT^A^T - EWgi^gf;^^) /u. 

The first of these three is as in the prior proof, and yields the same Hpi-Hqg, only with the 
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bias of a g-degree fit: m(Xi ) — r q (X t — x)'/3 q . ff we define 



1 

1=1 


then the second term of — x l> q is equal to 

{A D i!r /,)w Li - »(*.-)} 

l i=i 

+ p‘+ 2 <f +1 ) (a, - A Pjl ) 

+ p 1 + 2 <- + 1 >a p ,, (r-‘ - f- 1 ) i'.r-'A,,! 

+ (r - 1 - f- 1 ) a p j 

+ p i+2(i>+i)A I , il f- 1 i a f- 1 (a r1 - v) . 


t _1 a 

q 


P, 1 


The first of these terms will also give rise to versions of A,i _ l 1 1j8 , only with the bias of a 
g-degree fit and changing K to L, ptoq,h to b, etc, and will thus be treated exactly as above. 
The rest of these are incorporated into T rbc , similar to how A 2 is treated, because Cramer’s 
condition is satisfied. The third and final piece of is equal to 


- V +(p+1) { ^ £(^ p )(x w )(iy)(x w ri {4 - „(*,)} }. r-'Ay 


i =1 


2 pi+Cp+Dil (V 1 


A ki 

2p 1 +(p+ 1 )^f- 1 (a p> 1 -A p>1 V 


and thus is entirely analogous, with yet another version of A,i~A, 8 defined for the remainder 
in the first line, and the second two easily incorporated into T rbc . 

From these arguments, it is clear that the analogue of Lemma 7 will hold for these cases 
as well: the same fundamental pieces are involved, and thus the same arguments will apply, 
just as above. 


S.II.6.3 Lemmas 

Our proof of Theorem 5 relies on the following lemmas. The first gives generic results used 
to derive rate bounds on the probability of deviations of the necessary terms. Some such 
results are collected in Lemma 5. Lemma 7 shows how to use the previous results to establish 
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negligibility of the remainder terms required for Eqn. (29). 

As above, we will generally omit the details required for Theorem 5 parts (b) and (c), to 
save space. These are entirely analogous, as can be seen from the steps in Lemma 5. Indeed, 
the first results are stated in terms of the kernel K and bandwidth h, but continue to hold 
for L and b under the obvious substitutions and appropriate assumptions. 

Throughout proofs C shall be a generic conformable constant that may take different 
values in different places. If more than one constant is needed, C\, C 2 , ..., will be used. 

Lemma 4. Let the conditions of Theorem 5 hold and let g(-) and t(-) be continuous scalar 
functions. 

(a) For some 5 > 0, 


s 2 P 


s- 2 Y, {(Kt){X Kl )g(Xi) - E [(Kt)(X h}i )g(Xi)}} 

i =1 


> 6s 1 log(s) 1//2 


->• 0. 


(b) For some 6 > 0, 


( c ) 


n 


2p 


-1 




{{Kt)(X hi f)g(Xi)ei} 


> hlog(s) 1//2 


-> 0. 


2=1 

The same holds with e\ — v(Xi) in place of £i, since it is conditionally mean zero and 
has more than four moments. 

For any 5 > 0, an integer k, and any 7 > 0, 



s - 2 Y( Kt )( x hM x i) M X*) 

i=1 


r p (Xi - x)'/3 p f 


> 6h (k ~ 1){p+1) log(s) 7 


-)■ 0. 


(d) For any 5 > 0 and any 7 > 0, 


s 2 P 


s - 2 Y( Kt )( x h,M x i)£i 

2=1 


^p(A j 


x)'P P ] 


> 5h p+1 log(s) 7 


->• 0. 


(e) For any 8 > 0, an integer k, and any 7 > 0, 


s 2 P 


n 

s ~ 2 Y[( Kt )( X Fi)9{Xi){™{Xi) ~ r p (Xi - x)'fd p ) k 
2=1 

- E [(AT)(AA, i )^(A 7 )(m(X i ) - r p (X 2 - x)'/3 p ) k ] } 


> 5h k{p+l) log(s) 7 


->• 0. 
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Proof of Lemma 4(a). Because the kernel function has compact support and t and g are con¬ 
tinuous, we have 


\(Kt)(X Ki )g(Xi) -n(Kt)(X hti )g(Xi)]\ < C v 


Further, by a change of variables and using the assumptions on /, g and t: 

V[(Kt)(X ht f)g(Xi)] <E[(Kt){X ht f) 2 g(Xi) 2 ] = j f(X. l )(Kt)(X Ki ) 2 g(X l ) 2 dX i 

— h j f(x — uh)g(x — uh)(Kt){u) 2 du < C 2 h. 

Therefore, by Bernstein’s inequality 


s 2 P 


E {(Kt)(X h}i )g(Xi) ~ E [(Kt)(X h:i )g(Xi)}} 


2—1 


> <5s 1 log(s) 1,/2 


2 2 j _ ( g4 )(^ 1 l°g( g ) 1/2 ) 2 /2 \ 

~ C 2 s 2 + C lS 2 5s- 1 log(s)V 2 /3 J 


= 2 exp{2 log(s)} exp 


5 2 log(s)/2 


= 2 exp < log(s) 


2 - 


C 2 + C 1 5s- 1 log(s) 1 / 2 /3 

< 5 2 /2 


C 2 + C\5s~ l log(s) 1 / 2 /3_ 
which vanishes for any <5 large enough, as s -1 log(s) 1//2 —> 0. 

Proof of Lemma 4(b)- For a sequence r n —> oo to be given later, define 


□ 


Hi = s (Kt)(X hji )g(Xi) {YilfYi < r n } - E [Y^Yi < r n } | X,]) 


and 


Ti = s _1 (Kt)(X h: i)g(Xi) (Yit{Yi > r n } - E [Y^ > r n } | X,]). 
By the conditions on g(-) and t(-) and the kernel function, 

I HA < Cis~ l r n 


and 


V[Hi] = S - 2 ¥[(AT)(X^)<7(X 8 )Fa{^ < r n }] < s“ 2 E [(AT)(X /l ., i )7(X,) 2 y; 2 l{X i < r n }] 
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< s' 2 E [(Kt^XbrfgiXJX 2 ] 

= s~ 2 j(Kt) (X h i ) 2 g(X i ) 2 v(X i ) f (Xi)dXi 

= s~ 2 h j (. Kt)(u) 2 (gvf)(x — uh)du 

< C 2 fn. 


Therefore, by Bernstein’s inequality 


s 2 P 


J2 h ‘ 

2=1 


> <51og(s) 1/2 < 2 s 2 exp 


5 2 log(s )/2 


< 2 exp{ 21 og(s)} exp 


C 2 + C' 1 s- 1 r n (51og(s) 1 / 2 /3 

S 2 log(s )/2 


< 2 exp < log(s) 


2 - 


C 2 + C lS -ir n 6log( S y/*/3 

S 2 /2 

C 2 + Cis~ l r n 8 log(s) 1 / 2 /3 


which vanishes for <5 large enough as long as s 1 r n log(s ) 1//2 does not diverge. 

Next, by Markov’s inequality and the moment condition on Y of Assumption S.II.3.1 


s 2 P 


2=1 


> hlog(s ) 1//2 < s 2 


< A 


< s“ 


< s' 


< s“ 


< s' 


< 


5 2 log(s) 
1 

8 2 log(s) 
1 

8 2 log(s 
1 

S 2 log(s) 
1 

8 2 log(s 
1 

8 2 log(s) 
s 


E 


E t > 

2=1 

nE [T 2 ] 

-nV [s-'iKtXXhJgiXjYiliYi > r n }] 
ns ~ 2 E [(KtXXhtfgiXifYfliYi > r n }] 
-ns ~ 2 E [{Kt)(X h ,) 2 g(X i ) 2 \Y r \ 2+ ^r- ri ] 


ns 2 (Chr n ^) 


C - 2 


5 2 log(s)r|’ 


which vanishes if s 2 log(s) _ 1 7 yA —y 0 . 

It thus remains to choose r n such that s~ l r n log(s ) 1 / 2 does not diverge and s 2 log(s) _ 1 T“^ —> 
0. This can be accomplished by setting r n = s 7 for any 2/£ < 7 < 1, which is possible as 
£ > 2 . □ 
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Proof of Lemma 4(c). By Markov’s inequality 


hP +l 


P 


s - 2 Y,{Kt){X h MXi) [m(Xi) - r p (X, - x)'P P ] k 


< 


< 


2=1 

1 1 


> 5h^ k ~ l)(p+1) log(s) 7 


hP +1 5/i( fc -!)(p+B log(s)T' 

1 


E 


to 1 ^ 


h-\Kt){X h ,f)g{Xi) [m(Xi) - r p {X t - x)'P P ] 


8h k{ v +v > log(s)'? 

= 0(log(s) -7 ) = o(l). 


h k{p+1) E 


h~ 1 {Kt)(X h)i )g(X i ) [h~ p ~ l (m{Xi) — r p (Xi — x)'j3 p )\ 


This relies on the following calculation, which uses the conditions placed on m( 


E 


h- 1 ((. Kt){X K f)g{Xi)ei) M*i) - r p (X, - x)'(5 p ) k 
= h - 1 f (gfv)(Xf)(Kt)(X hti ) [miXj-rpiXi-xyp/dXi 

= hI (gfv)(X,)(Kt)(X hii ) ( (p + 1) , (V - x ) ) dx < 
= A‘ (P+1) A _1 J(gfv)(X,)(Kt)(X h]l ) ( ^ XS 1 ) dXi 
= Cfc l <” +1 >h" 1 J (gfv)(X i )(Kt)(X lh ,)X k h l f I) dX i 
= Ch kir+1> J (gfv)(x - uh)(Kt) (ii)u‘ <p+1) (iii 

x ]l k (P+ 1 )_ 


□ 


Proof of Lemma 4(d) ■ By Markov’s inequality, since e, is conditionally mean zero, we have 


s 2 P 


s- 2 ^{Kt^X^giXijei [miX,) - r p (X, - x)’P p 


2=1 


> 5h p+1 log(s) 7 


< ,P 


< 


5h 2 ^ log(s) 2 ^ s 2 
s 2 h 2( p +l) 


E 


h- 1 ((KtXXnMXjeiY [m{ A 7 ) - r p {X, - x)’(3 P } 2 


5s 2 h 2{ -P +1) log(s)^ 
log(s) -27 -> 0, 


E 


h- 1 {(Kt){X hi f)g(Xi)£if [h-^imiXi) - r p {X, - x)^ p )\ 


where we rely on the same argument as above to compute the bias rate. 
Proof of Lemma 4( e )- Follows from identical steps to 4(d). 


a 

□ 


77 



























To illustrate how the above Lemma is used for the objects under study, we present the 
following collection of results. This is not meant to be an exhaustive list of all such results 
needed to prove all parts of Theorem 5, but any and all omitted terms follow by identical 
reasoning. 

Lemma 5. Let the conditions of Theorem 5 hold. 

(a) For some 5 > 0, rT 1 P[|r p — T p | > s ^ 1 log(s) 1 / 2 ] —>■ 0. Consequently, there exists a 
constant Cr < oo such that P[T p 1 > 2Cp] = o(s -2 ) and so the prior rate result holds for 
|T “ 1 — T p x | as well. Finally, these same results hold for T 9 as well. 

(b) For some 8 > 0, r* 1 P[|A Pi i — A Pi i| > s 1 log(s) 1//2 ] — > 0. 

(c) For some 8 > 0, 


s 2 P 


s 1 y ^{(Kr p )(X hji )£i} 
2=1 


> hlog(s ) 1 / 2 


-> 0 . 


(d) For any 8 > 0 and 7 > 0, 


1 

hP +1 


P 


s- 2 Y,{(Kr p )(X lhi ){m(X,) 

2=1 


r p 


x )'Pp ]} 


> hlog(s ) 7 


->• 0 . 


(e) There is some constant Cq, such that P[T p > 2 Cq,\ = o(s^ 2 ). 

Proof of Lemma 5(a). A typical element of T p — T p is, for some integer k < 2 p, 

1 n 

^ E - E [X(X W )X£J }. 

2=1 

Therefore, the result follows by applying Lemma 4(a) to each element. Next, note that under 
the maintained assumptions 

f p = E [h-\Kr p r' p ){X h)i )\ = h~ l J(Kr p r' p ){X ht f)f{X i )dX i = f (. Kr p r' p ){u)f{x - uh)du 


is bounded away from zero and infinity for n large enough. Therefore, there is a Cr < 00 such 
that (Tp 1 ! < Cr and then 


P [r ^ 1 > 2C r ] = P 
< P 


r - r 

p p 


-1 


+ r; 1 > 2C r 


r P 1 - r p 1 > s 1 lo g( s 


1-1 

p 


U/2 


+ P 


fp 1 > 2C r - s 1 log(s ) 1/2 
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The third result follows from these two and the identity T p 1 — T p 1 = T p 1 (T P — r p )T p 1 . 


Finally, for T q , the identical steps apply with L, q, and b in place of K , p, and h. Q 

Proof of Lemma 5(b). Follows from identical steps to the previous result. □ 

Proof of Lemma 5(c). Follows from identical steps, but using Lemma 4(b) in place of Lemma 
4(a). p'Q 

Proof of Lemma 5(d). Follows from identical steps, but using Lemma 4(c) in place of Lemma 
4(a). □ 

Proof of Lemma 5(e). A typical element of Tp is 


1 

nh 


2—1 


and hence under the maintained assumptions the result follows just as the comparable result 
on r p . □ 

We next state, without proof, the following fact about the rates appearing in all these 
Lemmas, which follows from elementary inequalities. 

Lemma 6. If r\ = 0{r'f) and r 2 = 0(r 2 ), for sequences of positive numbers r±, r\, r 2 , and r 2 
and if a sequence of nonnegative random variables obeys (r 1 ) _ 1 P[f/ n > r 2 ] — > 0 it also holds 
that (r' 1 ) _ 1 P[C/ n > r' 2 ] —> 0. 

In particular, since r* = ma x{s^ 2 ,rj 2 ,s~ 1 rj} is defined as the slowest vanishing of the rates, 
then rj _ 1 P[|f/ / | > r n ] = o(l) implies r“ 1 P[|f/ , | > r n ] = o(l), for equal to any of s~ 2 , p 2 , or 
s~ l p. Similarly, r n may be chosen as any sequence that obeys r n = o(r*). Thus, for different 
pieces of U defined in Eqn. (29), we may make different choices for these two sequences, as 
convenient. 


The next Lemma proves Eqn. (29), a crucial step in the proof of Theorem 5(a). Because 
this result only involves undersmoothing, we will omit the subscript p as above. 

Lemma 7. Let the conditions of Theorem 5(a) hold. Then Eqn. (29) holds, namely, for some 
r n = o(rf) 

-¥[\U\ >r n ] ->0. 
r* 
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Proof. Recall the definition: 


U 


1 3 

~2fr3 e qF 1 (+1,6 + +1,7 + +1,8) r 1 eo + 


e'r - 1 


E 8 


k =2 


A 


l,fc 


r-'eo 


n 2 


5 (a 2 — a 2 ) 3 
16 a 7 


x |se[ ) r~ 1 i? , lR(y' - M)/n + se {) Y- l R'W(M - Rp)/n } 

1 (+ 1,6 + +1,7 + + 1,8 ) f 1 e 0 | T], 


To fully prove the claim of the lemma, we must fully expand U and bound each piece. First, 
we present complete details on two terms. The remainder are entirely analogous, as discussed 
below. Consider the pieces involving A 16 , namely: 


e , 0 r“ 1 Ai, 6 r- 1 e 0 {se / 0 r” 1 i? , lF(l" - M)/n + se' 0 T~ l B!W{M - i?/3)/n| - efi- 1 A lfi t~ l e 0 rj. 


The first of these is 


e / o r" 1 A 1 , 6 r^ 1 e 0 seor" 1 i? , lF(y' - M)/n = e'cT” 1 (A lfi - ii, 6 ) Y- 1 e 0 se' 0 Y- 1 R'W{Y - M)/n 

+ e' 0 (V 1 - f” 1 ) A^T-Aose'oT^R'WiY - M)/n 
+ egf _1 t4 1)6 (y- 1 - f" 1 ) e 0 se'fY~ l R!W{Y - M)/n 
+ e'of^ii^f^eose'o (h” 1 - f” 1 ) R!W{Y - M)/n 
+ e'^Y" 1 AifiY 1 eose'^Y -1 R!W(Y - M)/n. 

=■ U hl + C/ 1)2 + U h3 + Ci, 4 + U h 5 


We now bound each remainder in turn. First, for r n = h p+1 log(s) + 2 , we have 


s 2 P [|C 1;1 | > r n ] = s 2 P | le'or - 1 \A lfi - + 1)6 J r- 1 e oSe / o r- 1 J R / W(y' - M)/n 
< s 2 P | 8 Cp +i, 6 -+i ,6 > log0) _ 1 / 2 r\ 


> r r 


s 2 P 


- 1 V {(*>,)(*,)£,} 


i= 1 


> log(s ) 1/2 


= s 2 P 

= 0 ( 1 ), 


8 Cr 


+i,6 — +i,6 


> K 2{ - p+l) log(s ) 7 


JR(p+Y log(s ) 1 / 2 " 1 " 7 


+ s 2 3P [r ; 1 > 2C r ] 
+ 0 ( 1 ) 


because h 2 + + 1 V n log(s) 1,/2 1 — h + +1 ) log(s) 1 7 —* oo. 
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Next, since x h 2<J>+l \ for r n = h p+l log(s) 1 / 2 . 


s 2 P [|f/i >2 | > r n ] = s 2 P e ' 0 (r- 1 - f" 1 ) A lfi T~ l e Q se' Q Y- l R'W(Y - M)/n 


> ry 


< s 2 P 


4 Cl 


A 


1,6 


^{(Kr.XX^e,} 


2—1 


> s log(s) + 2 rv, 


s 2 P 


T " 1 - f - 1 > s " 1 log(s) 1/2 l + s 2 2P [r ; 1 > 2 Cr 


= s 2 P 

= o(l), 


4 C 2 


- 1 Y.^ Kr M X Ki) £ i) 


2—1 


> log(s ) 1/2 


S'fr, 


h 2 (p+Y log(s) 


+ o(l) 


because sr n h~ 2( ' p+1 ' > log(s ) -1 = sh~^ p+l ' > log(s)” 3 + —» oo. Terms U\^ and t/ 1.4 are nearly 
identically treated. 

Let r n = k p+1 log(s)~ 1//2 . Then since x /i 2 + +1 ), 


s 2 P [|f/ 1>5 | > r n ] = s 2 P e / 0 r^M li 6 r^ 1 e 0 se / 0 r- 1 i? / lT(y' - M)/n 


> T r 


< s 2 P 


< s 2 P 


<4 


Cr 


A 


- 1 'Zu Kr M x ^ 

i=1 

s-^iiKtXXhMXi)*} 


> T r. 


2=1 


> 


log( a ) 1/2 l ° S(a) ~ r " 

gl J h 2(p+ 1) 


= o(l), 


because h 2(p+1) r„ log(s) + 2 =/r + +1) log(s) 1 —> 00 . 

Thus, since a^ 1 is bounded away from zero, we find that 


s 2 P | |4ye / o r- 1 v4 1 , 6 r” 1 e 0 se( ) r- 1 i? / fT(y' - M)/n 
2a 6 


> 


->■ 0. 


Turning our attention to the second term, we have 


e' Q Y~ 1 A lfi Y~ l eose' G Y~ 1 R'W{M - R(3)/n - e' Q Y~ l A^Y^e^r) 

= e' 0 T" 1 (\ 6 - i 1)6 ) r- 1 e 0 se' o r- 1 i? , fT(M - Rf3)/n 
+ e^T-^^eT-^ose'oT- 1 ( R'W(M - R(3)/n - E [R'W(M - Rf3)/n\) 
+ e' 0 (V " 1 - f’ 1 ) ii.eT-^oseoT-^ [R'W(M - R(3)/n] 

+ e' 0 f _1 ii i6 (r -1 - f” 1 ) e 0 se / 0 T~ 1 E [ R'W(M - R(3)/n] 
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+ e'of^i^f-^ose'o (V” 1 - f” 1 ) E [E!W(M - Rf3)/n\ 

—■ C/2,1 + C/2,2 + C/2,3 + U2A + ^ 2 , 5 - 

For r n = h p+1 log(s)^ 1 , we have 

r.- 1 P||£/ 2 ,i| > r n ] = r"‘P [e' 0 r-‘ (-4,, e - -4i, e ) T-'e o se! 0 r- 1 I(W(M - R0)/n > r„ 

< r,-¥ [sCfla |-4,, 6 - ii,| > sh^'> log^^ 

r 1 n 

+ r~ l F — Y {( Kr p)(X h ,i) [m(Xi) - r p (Xi - x)'/3 p }} > log(s) 7 

i=1 

+ r~ 1 3F [r; 1 >2C r \ 

< s 2 p [s ck |-4 I(6 - ii, 6 | > s /, 2( >- +i ' i°g(«) % ;i2 , P+ i)'; og(s)2 ^ 

r n 

+ ^- (p+1) p ^{(iPr p )(X^)[m(^)-r p (X^x)%]} > log(s)' 


+ s 2 3P [r; 1 > 2 C r ] 

= o(l), 

because .s/i 2 ■ p+1 ^ r ~ 1 log(s) 27 = s/i p+1 log(s) 1+27 —» 0 by the conditions on 77 placed in the 
theorem. 

Next, with r n = h p+1 log(s) -1 and using Ai 6 x /i 2 ^ p+1 \ we have 
r“ 1 P [| C/2,2 1 > r„] = r“ 1 P [ e'or-^i.gr^eose'or- 1 (R'W(M - R( 3 )/n - E [A'VF(M - £/ 3 )/n]) > r, 

n 

<r- x P 8Cp |-A 1)6 | S - 1 ^{(A%)(X M )[m(A 7 )- r p (A 7 -a;) / /3 p ] 

- E [(A> P )(X M ) [m(Aj) - r p (X, - *)%]]} > r n 
+ r" 1 3P [r" 1 > 2Cr] 

r n 

< s 2 P 8Cp S - 2 Y{( Kr p)( X h,i) [mix,) - r p ( A 7 - x)'/ 3 p ] 

2=1 

- E {(Kr r )(X hti ) [m(Xi) - r„(X, - x)'p r ]\) > ftX log(«) \ 3(p+1 ^ ( 

+ s 2 3P [r; 1 > 2C r ] 

= 0 ( 1 ), 
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because r n h 3 ( p+1) log(s) 7 = h 2 ( p+1 ) log(s) 1 7 —* oo. 

Third, as Ai fi x /? 2 ( p+1 ) and E [ R'W(M — R(3)/n ] x h p+1 , if we choose r n = h p+1 log(s) -1 , 


r * p [|^2,3 1 > r n] < r* P 


4C 2 s 


'' — l p —1 


>s 1 log(s ) 1//2 


sr r, 


h 3 (p +1 ) log(s ) 1 / 2 


+ r“ 1 2 P [T ; 1 > 2 C r ] 


< s 2 P 


4 Cl 


r-i-f " 1 > s - 1 iog(s ) 1/2 


h 3 (p+ 1 ) log(s ) 1 / 2 


+ s 2 2P [r; 1 > 2C r ] 
= o(l), 


because r n h~ 3 ^ p+1 ' 1 log(s )~ 1//2 = h~ 2( ' p+1 ' 1 log(s ) -1-1 / 2 —> oo. The terms C/ 2,3 and C/ 2,5 are han¬ 
dled identically. 

Thus, since d -1 is bounded away from zero, we find that 


s 2 P 


—e'r- 1 A 1 , 6 r- 1 e 0 se'r- 1 i? / fT(M - i?/3)/n - e'f^i^f 


£oV 


> T r 


->• 0 . 


The same type of arguments, though notationally more challenging, will show that the 
remainder of U obeys the same bounds. Note that the rest of the terms are even higher order, 
involving either 7 and A 18 , or the square or cube of the other errors. It is for this reason 
that only the “leading” three terms need be centered, that is, why only 

_ 2 Cb* e °^ 1 (^1,6 + ^1,7 + Ai $j f 1 eol r] 


appears in z. 


□ 


S.II.6.4 Computing the Terms of the Expansion 

Identifying the terms of the expansion is a matter of straightforward, if tedious, calculation. 
The first four cumulants of the Studentized statistics must be calculated (due to James and 
Mayne (1962)), which are functions of the first four moments. In what follows, we give a 
short summary. Note well that we always discard higher-order terms for brevity, and to save 
notation we will write = to stand in for “equal up to o^nh)^ 1 + (nh)~ 1 ^ 2 t]+r] 2 y i , and including 
o(pi+2(P+i)) for T bc . 

The computations will be aided by putting all three estimators into a common structure. 
In close parallel to the density case, let us define rhi := rh and m 2 = rh — rh m , af := a 2 s , 
and cr 2 := cr 2 bc , so that subscripts 1 and 2 generically stand in for undersmoothing and bias 
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correction, respectively. With this in mind, we write 


T us — Tii, T hc — T 2j i, and T rbc — T 22 , 

again paralleling the density case, so that the first subscript refers to the numerator and the 
second to the denominator. In the same vein, with some abuse of notation, we will also use 3 
ri{u) = r p (u), r 2 {u ) = r q {u ), K x (u) = K(u ), K 2 (u) = L{u), hi = h, and h 2 = b, as well as 

=era, 

e\ix„x 1 ) = e 1 jx„x J ) ] 

4(Xt) = era, 

l\( A'„V)=C(A'„A',). 


For the purpose of computing the expansion terms (i.e. moments of the two sides agree 
up to the requisite order), recalling the Taylor series expansion above, we will use 


T ps 

L V.W ~ 


1 — (Wui, 1 + Kd,1 + Vw,2) + (W w ,i + V W} i + V Wj2 ) 


a w {E v ,i + E v ,2 + E v $ + B V: 1 } , 


where we define, for v E {1,2}, 




2=1 


E v o — s 




hih) 2 ^ ^ 
v > i= 1 7=1 


n n n 


^v,3 • & 




(n/r ) 3 

v ’ i= 1 7=1 fc=l 


where the final line defines £^ s (Xi, Xj, Xk) in the obvious way following £{ s . To concretize the 
notation, for undersmoothing we are defining 


£1,1 = s<{ l rp l «;,ii.,iv - M)/n, 

Elf = sej.f-yfp - T„)t- l R' p W r (Y - M)/n, 

Eif = se' 0 t-\t p - r p )f-‘(f p - T„)t^R' p W p (Y - M)/n. 

3 Throughout Section S.II, we use only generic polynomial orders p and q , and so this notation will not 
conflict with the local linear or local quadratic fits, which would also be denoted rqfu) and r 2 (u), respectively. 
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In a similar way, 


n .. n n 

in,,, = ^ E (W («? - «(*<))} - 2-jp E E {«(^) 2 r„(X t „, i )'f;'(^r„)(X k „ i ) £j ^ 

2=1 2=1 j=l 


n n n 


+ 


n 3 h 3 


E E E {C(X,)\(X h Jf-\K v r v )(X h ,,)E i 


£k r ? 


2=1 j=l /c=l 


1 n 1 n n 

U, = -T E - EK(Xi)V X,) 2 ]} + 2-j- E E 4(*i> 


2=1 


^ n n n ^ 

vi, 2 = 4,EEE ^(X,, X,X(X,, X fc )n(X,) + 2-^ EEE ^ (Xj, Xj , Xfe) £° (Xj) v (Xj), 

i=l i=l fc=l 


n 3 h 3 


i= 1 j=l fc=l 

and specifically for undersmoothing and bias correction, let 


n 2 h 2 

1 

n 3 h 3 


i= i j=i 

n n n 


B 1,1 = S— Y J ^( x i)Vn < yX i ) - r p (Xi - x)'(3 P ] 


i= 1 


and 


-i ™ 

£ 2,1 = Ej^WMXO - r p+ \(Xj - z)'/? p+1 ] 

- ft -1 ftW) - CM) [™M - r,(X, - x)' 

Note that r/ us = and ?/ b c = E[S 2 ,i]- 

Straightforward moment calculations yield 


EfT^] = aJ-E [B Vt 


25-2 


■E \W w ,\E Vi \\ , 


— ^ryE [^v,i + ^,2 + 2 E Vt iE Vt 2 + 2£7 Vj 1 £7 Vj3 ] 

® w 

— -ryE \ { W w ^\E‘1 1 + + V Wt 2E 2 vl + 2K ) ,i^,i^ )2 ] 


+ -t-E [w 2 ^ 2 1 + F „ 2 ,] + ^E [B 2 ,] - 2;-E [lU.A.A,,], 


(7,1 


(T„, 


*W,J A ^E [£?,,] - 4 t E [»W<i] + JrE [S 2 ^.,,] , 


25" 5 

w 


(X„, 
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and 


ERJ = ^E [E^ + 4 E^E^ + 4 E* A E vfl + 6 E^E^} 

- ^E [W wA E^ + V Wtl E* fl + 4 V w>1 E* tl E Vt2 + V w , 2 E vA ] 

® w 

+ Jr E KXi + KKA 

+ 4-e [sis„,i] - Ee [W^E^B^] + Ae KX] 


Computing each term in turn, we have 


IE Tjy , 

E [W^iK.l] = S-'E [/T 1 ^ (X^°(X;) £ 3 ] , 

E KJ = d 2 , 

E [^,1^,2] = s- 2 E [/i _1 ^(Xj, Xj)^(Xj)£ 2 ] , 
[/T^X^X,-) 2 ^], 

E [^, 2 ^, 3 ] = s“ 2 E [/^ _2 ^ 2 (Xj, Xj, Xj)£°(Xj)£ 2 ] , 

E KXJ = ^ 2 (e [ft-V“ (A',) 2 ^ (Jf,) 2 (4 - 4A'i) 2 )] 


- 25\ 2 E 


/ i - 1 C(^)>-(X^, i )T- i (X w r U) )(X ^ )i )£ 2 


/7- 1 £°(X 4 ) 2 £°(X i ) 2 r„,(X^,) , r.- 1 4 2 E[/ i - 1 (X w r U) )(X, i „, i )^(X i )£ 2 ] 


— 4E 

d 2 E I /r 2 £° (X,) 2 ( r w (X hiu i )'r~ 1 (K w r w )(X hiu j )) e 


+ E 


»‘W (, E [A-■V p (A' t , i )T;‘(A> p )(A' ftJ )C(A i )4|A 

E [K, il Sj 1 ] S s -2|e [/r 1 (<» (A,) 2 t>(A,) - ¥.[f° w (X i ) 2 v(X i )\) 4(A,) 2 £ 2 ] 

+ 2ft 2 E [h-HUX„ X,)C(XMXi )}}, 

E [14,, 2 ] A s - 2 {e [ft- 2 (4(A"j) 2 i'(Aj) - E[^ u (X j ) 2 v(X j )]) (KXt.xMXt )ef] 
+ 2E [ft-^fAi, AX(A' t , AX,(A',)4(A4MA',)4]}, 

E MJ A s - 2 {ft 2 E [ ft " 2 (4(A,, A^) 2 + 2C(A'„ A'i, Aj)) t>(A,)] }, 

E [WjX] A ,-*{^E [ft-'OAi) 4 (4 - t.(A,) 2 )] + 2E [h^^X^X.felf), 



E Ki<i] = 


s~ 2 a 2 J E 


E \W w ,iE Vt iB V} i] 

E Ki] 
E [H4,i£, 3 J 

E [<J 

E 

E [-^u,1-^,3] 

E [E 2 v>l El 2 ] 
E [W^XJ 
E [14,i<J 

E [Vw^E^^Ey^] 
E [14,2^ 4 ,i] 
E [W£XJ 
E Ki<i] 


/i- 1 (CMMa,) - E[C(A44(X,)])‘ 

+ 4E [/r 2 (CixjMXi) - E [Cix^MXi)]) e w (x j: xMXjMXjjl 

+ 4E [h-HKXt, Xs)^(X,)v(XiK(Xt, XMX.MXt )]}, 

E [W Wt iE Vt i] E [B V: i ], 
s _1 E [h-XiX^ef] , 

E [Ev,i\ E [W Wj iE Vj i] , 

35" 4 + s _ 2 E [rt°(A4 4 4] , 
s- 2 6a 2 E [h-^KX^XMXje 2 ], 

S - 2 3a 2 E [/r 2 £ 2 (A i; X„X,)^(A4e 2 ] , 

s- 2 4 e [*>“ 2 4(^. -V) 2 * 2 ?] + 2 E [h-HK x„ a'X(W, XMXM x t ) c y t ]}, 

«-*{e {h-'t(Xi)X(Xi)4] E [h-'<;(X,)*eJ) + 6E {El,} E [W^E 2 ,] }, 
S - 2 s26{e [A- 1 (t(X,) 2 v(X,) - E \el(X,fv(X,)}) C(X,) 2 el] 

+ 2E [h~X(Xi, X^XMXjf^viX,)] + E [h-'t (X„ X,)e°, (XMXi )]}, 
3E [E 2 ,] E If 2 ], 

3E [E 2 ,] E [V w , 2 E 2 ,] , 

3E [El ,] E [Wl,El,} , 

3E [E 2 ,] E KjE 2 ,] . 


The expansion now follows, formally, from the following steps. First, combining the above 
moments into cumulants. Second, these cumulants may be simplified using that 

?f = i + i(w^v) (p 1+(p+1) n i, bc + p 1 + 2 (p+ 1 ) fi 2 ,bc) 

® w 

and that in all cases present products such as £°(A \) kl l° v (Xi) k2 and ^(Xi, Xj) kl £l(Xi, Xj) k2 
may be replaced with £®(Xi) kl+k2 and £ 4 (Aj, Xj) kl+k2 , respectively, provided the arguments 
match. This is immediate for v = w, and for v 4 w, follows because p —> 0 is assumed. This 
is the analogous step to Eqn. (16) in the density case. For any term of a cumulant with a rate 
of ( nh)~ l , (?i/r) -1 / 2 ?7„, r/ 2 , or p 1+2 iP +1 ) (i.e., the extent of the expansion), these simplifications 
may be inserted as the remainder will be negligible. Third, with the cumulants in hand, the 
terms of the expansion are determined as described by e.g., (Hall, 1992a, Chapter 2). 
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S.II.7 Complete Simulation Results 


In this section we present the results of a simulation study addressing the finite-sample per¬ 
formance of the methods described in the main paper. As with the density estimator, we 
report empirical coverage probabilities and average interval length of nominal 95% confi¬ 
dence interval for different estimators of a regression functions m(x) evaluated at values 
x = {—2/3,—1/3, 0,1/3,2/3}. For each replication, the data is generated as i.i.d. draws, 
i — 1, 2,..., n, n — 500 as follows: 


Y = m(x) + e, x~U[— 1,1], £ ~ IN"(0,1) 


Model 1: m(x) = sin(4x) + 2exp{—64a; 2 } 

Model 2: m(x ) = 2x + 2exp{—64a; 2 } 

Model 3: m(x) = 0.3 exp{—4(2a; + l) 2 } + 0.7exp{—16(2a; — l) 2 } 


Model 4: m(x) 
Model 5: m(x ) 

Model 6: m(x ) 


x + 50(lOa;) 

sin(37ra;/2) 

1 + 18x 2 [sgn(x) + 1] 

sin(7ra;/2) 

1 + 2x 2 [sgn(x) + 1] 


Models 1 to 3 were used by Fan and Gijbels (1996) and Cattaneo and Farrell (2013), while 
Models 4 to 6 are from Hall and Horowitz (2013), with some originally studied by Berry et al. 
(2002). The regression functions are plotted in Figure S.II.l together with the evaluation 
points used. 

We compute confidence intervals for m( x) using five alternative approaches: 


US: local-linear estimator using a conventional approach based on undersmoothing (/ us ). 

Locfit: local lineal estimator computed using default options in the R package loefit (see 
Loader (2013) for implementation details). 


BC: traditional bias corrected estimator using a local-linear estimator with local-quadratic 
bias-correction, and p = 1 (7 bc )- 


HH: local linear estimator using the bootstrapped confidence bands introduced in Hall and 
Horowitz (2013) (see Remark 10 below for additional implementation details). 




RBC: our proposed local-linear estimator with local-quadratic bias-correction and p — 1 using 
robust standard errors (J rbc )- 

In all cases the Epanechnikov kernel is used. The bandwidth h is chosen in three different 
ways: 

(i) population MSE-optimal choice /i mse ; 

(ii) estimated ROT optimal coverage error rate h rot . 

(in) estimated DPI optimal coverage error rate h dpi . 

For the construction of the variance estimators <3^ s and d^ bc we consider HC3 plug-in residuals 
when forming the £ matrix. In Table S.II.9 we report empirical coverage and average interval 
length of RBC 95% Confidence Intervals (only for Model 5) using h mse for different variance 
estimators. The results reflect the robustness of the findings to this choice. 

The results are presented in detail in the tables and figures below to give a complete 
picture of the performance of robust bias correction. First, Tables S.II.1-S.II.6 show, for 
each regression model, respectively, the performance of the five methods above, in terms of 
empirical coverage and interval length, for all evaluation points and bandwidth choices (recall 
that J us and / bc have the same length). Panel A of each shows the coverage and length, 
while Panel B gives summary statistics for the two fully data-driven bandwidths. Note that 
in some cases, the population MSE-optimal bandwidth is not defined or is not computable 
numerically; usually because the bias is too small or other values are too extreme. 

The broad conclusion from these tables is that robust bias correction provides excellent 
coverage and that the data-driven bandwidths perform well and are numerically stable. In 
almost all cases robust bias correction provides correct coverage, whereas the other methods 
often, but not always, fail to do so. In cases where there is little to no bias all the methods 
give good coverage. This can be seen in results for Models 2 and 4, at |x| = 2/3, far enough 
away from the “hump” in the center of each, where the true regression function is (nearly) 
linear. But despite the encouraging results away from the center, only robust bias correction 
yields good coverage closer to the center (|x| = 1/3), when there is more bias. Going further, 
considering x — 0, the center of the sharp peak in these models, we see that even robust bias 
correction fails to provide accurate coverage for h rot , although /i dpi performs slightly better. 
At this point, for these models, the bias is too extreme even for robust bias correction to 
overcome. The results for the other models yield similar lessons. 

It is somewhat more difficult to compare interval length using these tables. The comparison 
is invited for a fixed bandwidth, in which case, by construction, undersmoothing will have a 
shorter length. However, this ignores the fact that robust bias correction can accommodate 


a larger range of bandwidths, and in particular will optimally use a larger bandwidth. For 
example, robust bias correction has excellent coverage in many cases for h rot , which is in this 
case a data-driven MSE-optimal choice (i.e. they coincide). This bandwidth is generally larger 
than /i dpi , and hence undersmoothing generally covers better with the latter. However, if you 
compare the length of I us (h TOt ) to the length of I us (h dpi ), we see that robust bias correction 
compares favorably in terms of length. 

Both to better make this point and to illustrate the robustness of J rbc to tuning parameter 
selection, Figures S.II.2-S.IL13 show empirical coverage and length for all six models, and all 
evaluation points, across a range of bandwidths. The dotted vertical line shows the population 
MSE-optimal bandwidth (whenever available) for reference. The coverage figures highlight the 
delicate balance required for undersmoothing to provide correct coverage, and the generally 
poor performance of traditional bias correction, but show that for a wide range of bandwidths 
robust bias correction provides correct coverage. Further, interval length is not unduly inflated 
for bandwidths that provide correct coverage. Again, by construction, undersmoothing will 
yield shorter intervals for a fixed bandwidth, and this is clear from Figures S.IL8-S.II.13, 
but it is also clear that robust bias correction can use much larger bandwidths while still 
maintaining correct coverage. 

To further illustrate this idea, in Tables S.II.7-S.II.8 we compare average interval length 
of US and RBC 95% confidence intervals but at different bandwidths. First, in Table S.II.7 
we compute average interval length at the largest bandwidth that provides close to correct 
coverage for each method separately. Note that in all cases these bandwidths are not feasible: 
these are ex-post findings. Next, in Table S.II.8 we evaluate the performance of US and RBC 
confidence intervals at certain alternative bandwidths likely to be chosen in practice. First, 
we evaluate the performance of US confidence intervals at h = A/? mse for A = {0.5; 0.7}. We 
then compare the performance with RBC conhdence intervals computed using the optimal, 
fully data-driven choices h TOt and h dpi . Both tables reflect that, once we control for coverage, 
intervals lengths do not differ systematically between both approaches. 

Figures S.II.14-S.II.19 make this same point in a different way. For a range of bandwidths, 
as in the previous figures, we show the “average position” of J us and J rbc , where the center of 
the bar is placed at the average bias and the length of each bar is the average interval length 
across the simulations. The bars are then color-coded by coverage (green bars having good 
coverage, fading to red showing under cover age). These make visually clear that although un¬ 
dersmoothing provides shorter intervals in general, that this comes at the expense of coverage, 
while robust bias correction provides good coverage for a range of bandwidths, many of which 
are “large” enough to yield narrow intervals. 
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All our methods are implemented in software available from the authors’ websites and via 
the R package nprobust available at https ://cran. r-project. org/package=nprobust. 

Remark 10 (Implementation of Hall and Horowitz (2013)). The column HH computes the 
bootstrapped confidence bands introduced in Hall and Horowitz (2013), following as close as 
possible their implementation choices. First, we estimate m(x ) using a local linear estimator 
using the Epanechnikov kernel for our previously discussed bandwidth choices. Standard errors 
are calculated using their proposed variance estimator g 2 hh = kg 2 / fx(x) where k — J K 2 
and fx(x) is a standard kernel density estimator using a data-driven bandwidth choice hi. 
Then, we use the same estimator for the error variance a 2 = Y^i .=i ! n and £% = £{ — £, 
£i = Yi — rh(Xi), £ = n _1 Y^i=i A- Next, we take generate B = 500 bootstrap samples = 
{(Aj, Y*)}, 1 < i < n, where Y* = m(Xi)+£*, with £* obtained by sampling with replacement 
from the {£;}, 1 < i < n. With these bootstrap samples we can construct the final confidence 
bands using the adjusted critical values that approximates the estimated coverage error with 
the selected one. Following their recommendation, the final critical values are taken to be 
the £-level quantile (for £ = 0.1) obtained by repeating this exercise over a grid of evaluation 
points, which we choose to be the sequence (xi,..., xn} = {—0.9, —0.8,..., 0,..., 0.8, 0.9}. ■ 
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Figure S.II.l: Regression Functions 
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(d) Model 4 (e) Model 5 (f) Model 6 










Table S.II.l: Simulations Results for Model 1 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Bandwidth Empirical Coverage Interval Length 




US 

Locfit 

BC 

HH 

RBC 

US 

Locfit 

HH 

RBC 

x = -2/3 











h mse 

0.478 

57.3 

76.8 

83.2 

31.6 

94.3 

0.301 

0.329 

0.197 

0.421 

^rot 

0.202 

93.4 

94.0 

82.9 

94.4 

94.5 

0.439 

0.478 

0.466 

0.628 

^dpi 

0.132 

94.0 

94.3 

81.7 

96.1 

93.5 

0.575 

0.619 

0.644 

0.818 

x = -1/3 











h mse 

0.331 

4.2 

30.7 

82.3 

1.5 

93.1 

0.355 

0.376 

0.276 

0.486 

hrot 

0.486 

2.8 

9.0 

54.0 

2.2 

63.5 

0.326 

0.326 

0.199 

0.417 

hdpi 

0.284 

38.4 

61.9 

81.6 

34.4 

92.8 

0.385 

0.413 

0.343 

0.538 

x = 0 











hmse 

0.115 

52.4 

72.3 

83.4 

61.4 

93.8 

0.595 

0.623 

0.660 

0.825 

^rot 

0.463 

0.0 

0.0 

0.0 

0.0 

0.0 

0.353 

0.327 

0.199 

0.461 

^dpi 

0.182 

5.5 

14.8 

71.7 

7.2 

84.6 

0.502 

0.500 

0.504 

0.660 

x = 1/3 











hmse 

0.383 

93.2 

94.8 

77.4 

82.3 

91.4 

0.317 

0.353 

0.239 

0.454 

hrot 

0.339 

94.6 

95.4 

79.3 

87.6 

92.6 

0.340 

0.377 

0.281 

0.488 

^dpi 

0.222 

93.5 

94.0 

78.6 

93.1 

92.2 

0.435 

0.475 

0.450 

0.623 

x = 2/3 











^mse 

0.478 

58.8 

78.5 

83.6 

31.4 

94.5 

0.301 

0.330 

0.197 

0.421 

^rot 

0.290 

88.3 

92.4 

82.6 

82.5 

94.4 

0.364 

0.402 

0.323 

0.523 

^dpi 

0.193 

91.6 

93.0 

80.1 

91.8 

93.2 

0.466 

0.507 

0.502 

0.669 


Panel B: 

: Summary Statistics for the Estimated Bandwidths 




Pop. Par. 

Min. 

1st Qu. 

Median Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = -2/3 











hrot 

0.478 

0.1573 

0.1829 

0.1909 0.2018 

0.2018 

0.6882 

0.050 


^dpi 

- 

0.02829 

0.1012 

0.1232 0.1318 

0.1503 

0.9401 

0.057 


x = -1/3 











h r ot 

0.331 

0.2201 

0.3979 

0.4964 0.4855 

0.5719 

0.7242 

0.109 


^dpi 

- 

0.04872 

0.2539 

0.2862 0.2838 

0.3105 

1.317 

0.079 


x = 0 











^rot 

0.115 

0.2886 

0.4344 

0.4618 0.4635 

0.4912 

0.6602 

0.045 


^dpi 

- 

0.04657 

0.1637 

0.181 

0.1817 

0.1994 

0.3009 

0.028 


x = 1/3 











^rot 

0.383 

0.2103 

0.2815 

0.3353 0.3385 

0.3889 

0.5925 

0.066 


^dpi 

- 

0.02326 

0.1666 

0.2067 0.2225 

0.2626 

1.717 

0.090 


x = 2/3 











hrot 

0.478 

0.212 

0.2545 

0.281 

0.29 

0.3189 

0.5017 

0.044 


^dpi 

- 

0.02727 

0.1499 

0.1833 0.1934 

0.2241 

0.9813 

0.071 



Notes: 93 

(i) US = Undersmoothing, Locfit = R package loefit by Loader (2013), BC = Bias Corrected, HH = Hall 
and Horowitz (2013), RBC = Robust Bias Corrected. 

(ii) “Bandwidth” column report the population and average estimated bandwidths choices, as appropriate, 






















Table S.II.2: Simulations Results for Model 2 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Bandwidth Empirical Coverage Interval Length 




US 

Locfit 

BC 

HH 

RBC US 

Locfit 

HH 

RBC 

x = -2/3 











h mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

^rot 

0.326 

95.2 

95.3 

82.9 

86.6 

94.8 

0.350 

0.386 

0.280 

0.502 

^dpi 

0.219 

94.8 

95.0 

81.9 

92.6 

94.0 

0.443 

0.481 

0.422 

0.632 

x = -1/3 











h mse 

0.706 

0.0 

0.3 

1.0 

0.0 

3.8 

0.253 

0.267 

0.122 

0.355 

hrot 

0.459 

0.8 

18.7 

83.2 

0.2 

94.2 

0.303 

0.326 

0.189 

0.417 

hdpi 

0.311 

63.5 

77.9 

78.6 

54.1 

91.1 

0.362 

0.395 

0.294 

0.514 

x = 0 











hmse 

0.115 

52.4 

72.4 

83.4 

49.8 

93.8 

0.595 

0.623 

0.573 

0.825 

^rot 

0.495 

0.0 

0.0 

0.0 

0.0 

0.0 

0.341 

0.315 

0.174 

0.450 

^dpi 

0.197 

1.8 

7.6 

66.8 

1.8 

80.3 

0.487 

0.479 

0.432 

0.633 

x = 1/3 











hmse 

0.706 

0.0 

0.3 

1.1 

0.0 

4.8 

0.254 

0.267 

0.122 

0.355 

hrot 

0.459 

0.7 

18.7 

84.4 

0.0 

94.6 

0.303 

0.326 

0.189 

0.417 

^dpi 

0.311 

63.6 

77.3 

76.2 

54.3 

90.2 

0.361 

0.394 

0.294 

0.513 

x = 2/3 











h mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

^rot 

0.323 

94.9 

95.1 

82.4 

87.9 

94.5 

0.351 

0.388 

0.283 

0.505 

^dpi 

0.215 

94.4 

94.1 

80.9 

92.7 

93.7 

0.446 

0.487 

0.428 

0.640 


Panel B: 

Summary Statistics for the Estimated Bandwidths 




Pop. Par. 

Min. 

1st Qu. 

Median Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = -2/3 











hrot 

- 

0.2046 

0.2619 

0.2948 0.3262 

0.3819 

0.5828 

NA 


^dpi 

- 

0.01928 

0.1619 

0.2 

0.2186 

0.2577 

1.307 

NA 


x = -1/3 











h r ot 

0.706 

0.3176 

0.4301 

0.4558 0.4589 

0.4843 

0.6388 

0.042 


^dpi 

- 

0.1275 

0.2603 

0.2927 0.3106 

0.3373 

1.718 

0.090 


x = 0 











^rot 

0.115 

0.3844 

0.4694 

0.4921 0.4949 

0.5171 

0.6434 

0.036 


^dpi 

- 

0.0537 

0.18 

0.1957 0.1968 

0.2125 

0.2953 

0.025 


x = 1/3 











^rot 

0.706 

0.2997 

0.4299 

0.4557 0.459 

0.4846 

0.6554 

0.042 


^dpi 

- 

0.1289 

0.2602 

0.2922 0.3105 

0.3392 

1.931 

0.093 


x = 2/3 











hrot 

- 

0.2065 

0.2596 

0.29 

0.3226 

0.3762 

0.5824 

0.082 


^dpi 

- 

0.02684 

0.1578 

0.1957 0.2146 

0.2558 

1.081 

0.089 



Notes: 94 

(i) US = Undersmoothing, Locfit = R package loefit by Loader (2013), BC = Bias Corrected, HH = Hall 
and Horowitz (2013), RBC = Robust Bias Corrected. 

(ii) “Bandwidth” column report the population and average estimated bandwidths choices, as appropriate, 























Table S.II.3: Simulations Results for Model 3 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Bandwidth Empirical Coverage Interval Length 




US 

Locfit 

BC 

HH 

RBC 

US 

Locfit 

HH 

RBC 

x = -2/3 











h mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

^rot 

0.532 

91.2 

92.1 

85.7 

64.3 

95.2 

0.299 

0.313 

0.166 

0.405 

^dpi 

0.346 

93.3 

93.5 

82.5 

81.1 

94.1 

0.346 

0.374 

0.257 

0.495 

x = -1/3 











^mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

hrot 

0.696 

80.1 

87.0 

82.1 

43.2 

94.8 

0.232 

0.253 

0.116 

0.335 

hdpi 

0.491 

90.7 

92.6 

81.0 

68.2 

94.0 

0.281 

0.307 

0.173 

0.405 

x = 0 











hmse 

0.976 

13.5 

19.7 

39.8 

1.3 

61.9 

0.198 

0.214 

0.082 

0.283 

^rot 

0.696 

34.3 

63.2 

84.9 

7.7 

95.7 

0.234 

0.253 

0.116 

0.333 

^dpi 

0.491 

79.9 

87.7 

79.0 

56.1 

92.3 

0.282 

0.308 

0.174 

0.406 

x = 1/3 











hmse 

0.246 

77.8 

86.1 

79.6 

67.3 

92.7 

0.393 

0.423 

0.326 

0.563 

hrot 

0.695 

86.6 

83.2 

49.7 

52.7 

71.8 

0.237 

0.253 

0.116 

0.343 

^dpi 

0.494 

76.5 

71.5 

52.5 

47.2 

73.3 

0.285 

0.307 

0.172 

0.410 

x = 2/3 











^mse 

0.246 

79.0 

85.7 

79.7 

68.2 

92.8 

0.393 

0.424 

0.327 

0.564 

^rot 

0.505 

78.4 

75.7 

46.3 

47.5 

69.1 

0.307 

0.320 

0.176 

0.422 

^dpi 

0.325 

78.3 

82.7 

70.8 

60.2 

88.0 

0.360 

0.387 

0.274 

0.516 


Panel B: 

Summary Statistics for the Estimated Bandwidths 




Pop. Par. 

Min. 

1st Qu. 

Median Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = -2/3 











hrot 

- 

0.2424 

0.4328 

0.5297 0.5317 

0.6234 

0.855 

0.122 


^dpi 

- 

0.07286 

0.269 

0.3288 0.3458 

0.3989 

1.65 

0.119 


x = -1/3 











h r ot 

- 

0.4494 

0.6663 

0.7015 0.6957 

0.7301 

0.8429 

0.049 


^dpi 

- 

0.2405 

0.406 

0.4573 0.4913 

0.5358 

2.668 

0.140 


x = 0 











^rot 

0.976 

0.4981 

0.6655 

0.7024 0.696 

0.7324 

0.8301 

0.051 


^dpi 

- 

0.2371 

0.4036 

0.4592 0.4913 

0.5368 

2.671 

0.147 


x = 1/3 











^rot 

0.246 

0.4885 

0.6638 

0.702 

0.6953 

0.7321 

0.8256 

0.052 


^dpi 

- 

0.2426 

0.4062 

0.4596 0.4942 

0.5372 

2.992 

0.153 


x = 2/3 











hrot 

0.246 

0.204 

0.3964 

0.4995 0.5049 

0.6076 

0.8263 

0.131 


^dpi 

- 

0.06684 

0.24 

0.3105 0.3252 

0.381 

1.611 

0.118 



Notes: 95 

(i) US = Undersmoothing, Locfit = R package loefit by Loader (2013), BC = Bias Corrected, HH = Hall 
and Horowitz (2013), RBC = Robust Bias Corrected. 

(ii) “Bandwidth” column report the population and average estimated bandwidths choices, as appropriate, 

























Table S.II.4: Simulations Results for Model 4 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Bandwidth Empirical Coverage Interval Length 




US 

Locfit 

BC 

HH 

RBC US 

Locfit 

HH 

RBC 

x = -2/3 











h mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

^rot 

0.310 

95.1 

95.3 

83.2 88.3 

95.0 

0.357 

0.392 

0.294 

0.512 

^dpi 

0.208 

94.5 

95.0 

81.9 93.3 

93.9 

0.452 

0.491 

0.437 

0.646 

x = -1/3 











h mse 

0.466 

0.3 

8.1 

76.8 

0.0 

90.3 

0.300 

0.322 

0.184 

0.412 

hrot 

0.439 

0.6 

14.6 

82.5 

0.1 

94.1 

0.308 

0.332 

0.198 

0.425 

hdpi 

0.304 

56.5 

73.8 

79.5 47.9 

91.2 

0.366 

0.398 

0.301 

0.519 

x = 0 











hmse 

0.127 

51.8 

71.9 

83.4 50.8 

93.9 

0.564 

0.592 

0.556 

0.784 

^rot 

0.472 

0.0 

0.0 

0.0 

0.0 

0.0 

0.348 

0.320 

0.183 

0.446 

^dpi 

0.188 

6.6 

19.5 

76.1 

6.6 

88.8 

0.483 

0.489 

0.449 

0.645 

x = 1/3 











hmse 

0.466 

0.2 

8.9 

75.5 

0.0 

89.7 

0.300 

0.321 

0.184 

0.411 

hrot 

0.439 

0.4 

14.5 

82.9 

0.0 

94.1 

0.308 

0.331 

0.197 

0.425 

^dpi 

0.304 

57.3 

73.3 

77.0 48.8 

90.5 

0.366 

0.397 

0.301 

0.519 

x = 2/3 











h mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

^rot 

0.307 

94.8 

94.9 

82.2 88.7 

94.5 

0.358 

0.395 

0.296 

0.516 

^dpi 

0.204 

94.0 

94.2 

81.3 92.6 

93.9 

0.458 

0.498 

0.444 

0.657 


Panel B: 

Summary Statistics for the Estimated Bandwidths 




Pop. Par. 

Min. 

1st Qu. 

Median Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = -2/3 











hrot 

- 

0.2014 

0.2547 

0.2807 

0.3101 

0.3445 

0.5721 

0.078 


^dpi 

- 

0.02144 

0.1547 

0.1907 

0.2076 

0.2423 

1.878 

0.090 


x = -1/3 











h r ot 

0.466 

0.3059 

0.4129 

0.4367 

0.439 

0.4624 

0.5956 

0.038 


^dpi 

- 

0.1261 

0.253 

0.2857 

0.3038 

0.3325 

1.192 

0.087 


x = 0 











^rot 

0.127 

0.3694 

0.4497 

0.4703 

0.4722 

0.4923 

0.6034 

0.032 


^dpi 

- 

0.07306 

0.1725 

0.1869 

0.188 

0.2022 

0.317 

0.023 


x = 1/3 











^rot 

0.466 

0.2918 

0.413 

0.4367 

0.439 

0.4626 

0.6154 

0.038 


^dpi 

- 

0.1294 

0.2527 

0.2853 

0.3043 

0.3309 

1.856 

0.094 


x = 2/3 











hrot 

- 

0.2032 

0.2527 

0.2776 

0.3069 

0.34 

0.5755 

0.076 


^dpi 

- 

0.02984 

0.1513 

0.1869 

0.2041 

0.2418 

1.03 

0.084 



Notes: 96 

(i) US = Undersmoothing, Locfit = R package loefit by Loader (2013), BC = Bias Corrected, HH = Hall 
and Horowitz (2013), RBC = Robust Bias Corrected. 

(ii) “Bandwidth” column report the population and average estimated bandwidths choices, as appropriate, 






















Table S.II.5: Simulations Results for Model 5 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Bandwidth Empirical Coverage Interval Length 




US 

Locfit 

BC 

HH 

RBC 

US 

Locfit 

HH 

RBC 

x = -2/3 











h mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

hrot 

0.251 

95.1 

94.7 

82.9 

90.8 

94.8 

0.390 

0.423 

0.338 

0.560 

^dpi 

0.166 

94.8 

94.4 

81.8 

93.5 

93.7 

0.505 

0.544 

0.479 

0.722 

x = -1/3 











h mse 

0.307 

43.0 

69.0 

84.0 

25.3 

94.5 

0.354 

0.379 

0.271 

0.502 

hrot 

0.405 

9.7 

27.7 

81.9 

4.9 

93.4 

0.316 

0.333 

0.209 

0.439 

hdpi 

0.283 

56.5 

70.7 

80.6 

48.2 

92.8 

0.380 

0.409 

0.316 

0.540 

x = 0 











^mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

h ro t 

0.475 

24.8 

50.4 

79.4 

5.5 

92.8 

0.286 

0.308 

0.176 

0.409 

^dpi 

0.318 

74.4 

83.7 

80.3 

61.1 

92.6 

0.354 

0.383 

0.279 

0.507 

x = 1/3 











hmse 

0.821 

3.3 

37.3 

81.2 

0.1 

93.5 

0.226 

0.240 

0.102 

0.318 

hrot 

0.536 

72.1 

88.1 

77.3 

43.9 

92.1 

0.268 

0.292 

0.158 

0.384 

hdpi 

0.370 

89.9 

92.1 

78.5 

78.4 

92.9 

0.327 

0.356 

0.241 

0.470 

x = 2/3 











hmse 

0.886 

91.0 

94.2 

74.8 

46.0 

79.9 

0.288 

0.312 

0.107 

0.315 

^rot 

0.400 

93.5 

93.9 

82.7 

79.5 

94.4 

0.318 

0.341 

0.218 

0.453 

^dpi 

0.265 

93.9 

93.9 

81.3 

88.4 

93.6 

0.391 

0.425 

0.339 

0.562 


Panel B: 

Summary Statistics for the Estimated Bandwidths 




Pop. Par. 

Min. 

1st Qu. 

Median Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = -2/3 











hrot 

- 

0.1888 

0.2253 

0.2396 0.2513 

0.262 

0.5376 

0.044 


h<±p± 

- 

0.02597 

0.1302 

0.1563 0.1665 

0.1929 

0.8877 

0.065 


x = -1/3 











h r ot 

0.307 

0.245 

0.3677 

0.4093 0.4045 

0.4421 

0.5661 

0.053 


^dpi 

- 

0.04247 

0.2272 

0.2666 0.2825 

0.317 

1.897 

0.093 


x = 0 











hrot 

- 

0.362 

0.4439 

0.4707 0.4747 

0.5004 

0.6879 

0.044 


hdpi 

- 

0.157 

0.259 

0.2983 0.318 

0.3538 

1.57 

0.096 


x = 1/3 











hrot 

0.821 

0.2897 

0.479 

0.5284 0.5364 

0.5959 

0.7665 

0.079 


hdpl 

- 

0.1258 

0.2967 

0.3465 0.3699 

0.4137 

1.508 

0.115 


x = 2/3 











hrot 

0.886 

0.2545 

0.3442 

0.376 

0.3998 

0.4205 

0.7831 

0.086 


hdp± 

- 

0.06169 

0.2045 

0.2436 0.2651 

0.3033 

0.8067 

0.090 



Notes: 97 

(i) US = Undersmoothing, Locfit = R package loefit by Loader (2013), BC = Bias Corrected, HH = Hall 
and Horowitz (2013), RBC = Robust Bias Corrected. 

(ii) “Bandwidth” column report the population and average estimated bandwidths choices, as appropriate, 























Table S.II.6: Simulations Results for Model 6 


Panel A: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 


Bandwidth Empirical Coverage Interval Length 




US 

Locfit 

BC 

HH 

RBC 

US 

Locfit 

HH 

RBC 

x = -2/3 
h mse 

0.782 

88.9 

89.4 

90.6 

45.7 

94.6 

0.288 

0.298 

0.113 

0.332 

hrot 

0.565 

90.5 

91.4 

85.7 

60.8 

94.7 

0.294 

0.302 

0.152 

0.391 

^dpi 

0.371 

93.1 

93.2 

81.7 

77.4 

93.9 

0.333 

0.358 

0.233 

0.475 

x = -1/3 
h mse 

0.975 

80.1 

83.2 

77.2 

34.3 

91.1 

0.210 

0.217 

0.084 

0.295 

h-rot 

0.578 

91.5 

93.4 

83.6 

64.3 

95.2 

0.254 

0.276 

0.139 

0.366 

hdpi 

0.411 

93.8 

94.0 

82.4 

78.2 

94.0 

0.309 

0.336 

0.207 

0.445 

x = 0 











^mse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

hrot 

0.562 

87.0 

91.1 

81.6 

60.4 

94.6 

0.258 

0.280 

0.142 

0.372 

^dpi 

0.401 

90.9 

92.2 

80.5 

74.6 

93.3 

0.312 

0.340 

0.212 

0.450 

x = 1/3 

^mse 

0.616 

51.9 

73.4 

81.8 

18.5 

94.7 

0.246 

0.266 

0.129 

0.353 

hrot 

0.546 

66.6 

78.9 

80.9 

36.5 

93.7 

0.262 

0.284 

0.147 

0.376 

hdpi 

0.389 

83.3 

87.3 

79.6 

66.7 

93.3 

0.318 

0.345 

0.219 

0.458 

x = 2/3 











hjnse 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

hrot 

0.462 

94.9 

94.5 

83.3 

75.0 

94.4 

0.303 

0.317 

0.180 

0.427 

hdoi 

0.307 

94.4 

94.2 

81.3 

84.8 

93.7 

0.362 

0.392 

0.279 

0.520 


Panel B: Summary Statistics for the Estimated Bandwidths 



Pop. Par. 

Min. 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max. 

Std. Dev. 

x = -2/3 

hrot 

hdpl 

0.782 

0.2668 

0.1125 

0.5164 

0.2998 

0.5764 

0.3481 

0.5647 

0.3707 

0.6229 

0.4131 

0.7842 

2.057 

0.084 

0.121 

x = -1/3 

^rot 

0.975 

0.4066 

0.5321 

0.5717 

0.5778 

0.6184 

0.7935 

0.063 

^dpi 

- 

0.1991 

0.3296 

0.382 

0.4106 

0.4524 

2.029 

0.129 

x = 0 

hrot 

- 

0.4028 

0.5188 

0.5565 

0.5622 

0.5992 

0.7875 

0.059 

hdpl 

- 

0.1903 

0.3237 

0.3729 

0.4009 

0.445 

2.431 

0.125 

x = 1/3 

hrot 

0.616 

0.3523 

0.5043 

0.5402 

0.546 

0.582 

0.7959 

0.058 

hdpl 

- 

0.166 

0.3113 

0.3609 

0.3892 

0.4289 

2.203 

0.131 

x = 2/3 

hrot 

- 

0.262 

0.4121 

0.4535 

0.4618 

0.5018 

0.8093 

0.069 

hdpl 

- 

0.1082 

0.2444 

0.2868 

0.3071 

0.3461 

1.515 

0.100 


Notes: 


(i) US = Undersmoothing, Locfit = R package locf i'i^by Loader (2013), BC = Bias Corrected, 


HH = Hall 


and Horowitz (2013), RBC = Robust Bias Corrected. 

(ii) “Bandwidth” column report the population and average estimated bandwidths choices, as appropriate, 


for bandwidth h n . 
























Table S.II.7: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 




US 



RBC 



h 

EC 

IL 

h 

EC 

IL 

Model 1 

x = -2/3 

0.140 

94.8 

0.523 

0.420 

94.8 

0.442 

x = -1/3 

0.100 

94.7 

0.625 

0.420 

94.8 

0.434 

x = 0 

0.100 

71.3 

0.640 

0.100 

93.7 

0.893 

x = 1/3 

0.300 

94.6 

0.355 

0.440 

94.3 

0.425 

x = 2/3 

0.100 

95.0 

0.624 

0.260 

94.9 

0.546 

Model 2 

x = -2/3 

0.180 

94.9 

0.459 

0.540 

94.9 

0.399 

x = -1/3 

0.140 

94.8 

0.524 

0.440 

94.9 

0.424 

x = 0 

0.100 

71.3 

0.640 

0.100 

93.7 

0.893 

x = 1/3 

0.140 

94.5 

0.522 

0.440 

94.2 

0.424 

x = 2/3 

0.260 

94.9 

0.380 

0.280 

94.9 

0.525 

Model 3 

x = -2/3 

0.140 

94.9 

0.523 

0.420 

94.9 

0.442 

x = -1/3 

0.200 

94.9 

0.435 

0.400 

94.9 

0.440 

x = 0 

0.100 

94.7 

0.628 

0.680 

94.7 

0.337 

x = 1/3 

0.100 

93.9 

0.623 

0.100 

94.0 

0.887 

x = 2/3 

0.100 

94.6 

0.624 

0.180 

94.9 

0.658 

Model 4 

x = -2/3 

0.180 

94.9 

0.459 

0.520 

94.8 

0.406 

x = -1/3 

0.100 

94.8 

0.625 

0.400 

94.8 

0.444 

x = 0 

0.100 

79.3 

0.636 

0.100 

93.9 

0.893 

x = 1/3 

0.100 

94.4 

0.623 

0.400 

94.2 

0.443 

x = 2/3 

0.320 

94.9 

0.342 

0.280 

94.9 

0.525 

Model 5 

x = -2/3 

0.180 

94.9 

0.459 

0.200 

94.8 

0.624 

x = -1/3 

0.100 

94.7 

0.625 

0.180 

94.6 

0.658 

x = 0 

0.100 

94.6 

0.628 

0.240 

94.4 

0.572 

x = 1/3 

0.140 

94.6 

0.522 

0.260 

94.3 

0.545 

x = 2/3 

0.200 

94.8 

0.434 

0.280 

94.9 

0.525 

Model 6 

x = -2/3 

0.140 

94.9 

0.523 

0.600 

94.9 

0.379 

x = -1/3 

0.140 

94.8 

0.524 

0.420 

94.9 

0.429 

x = 0 

0.100 

94.8 

0.628 

0.600 

94.9 

0.359 

x = 1/3 

0.140 

94.5 

0.522 

0.480 

94.4 

0.401 

x = 2/3 

0.260 

94.8 

0.380 

0.420 

94.9 

0.442 


Notes: Bandwidths are selected ex post as the largest bandwidths yielding good coverage, and as can not be 
made feasible 
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Table S.II.8: Empirical Coverage and Average Interval Length of 95% Confidence Intervals 



US (A 

EC 

= 0.5) 

IL 

US (A 

EC 

= 0.7) 

IL 

RBC (h rot ) 
EC IL 

RBC (/i dpi ) 
EC IL 

Model 1 

x = -2/3 

94.4 

0.630 

94.7 

0.528 

94.3 

0.630 

93.5 

0.818 

x = -1/3 

56.5 

0.410 

21.1 

0.362 

63.3 

0.417 

92.8 

0.538 

x = 0 

0.0 

0.466 

0.0 

0.414 

0.0 

0.463 

84.6 

0.660 

x = 1/3 

93.5 

0.479 

94.1 

0.404 

92.4 

0.486 

92.2 

0.623 

x = 2/3 

95.0 

0.519 

93.3 

0.436 

94.9 

0.522 

93.2 

0.669 

Model 2 

x = -2/3 

94.9 

0.495 

95.2 

0.416 

95.1 

0.503 

94.0 

0.632 

x = -1/3 

92.7 

0.408 

57.9 

0.350 

94.4 

0.417 

91.1 

0.514 

x = 0 

0.0 

0.455 

0.0 

0.403 

0.0 

0.451 

80.3 

0.633 

x = 1/3 

92.4 

0.407 

58.0 

0.350 

93.9 

0.417 

90.2 

0.513 

x = 2/3 

95.3 

0.496 

95.0 

0.417 

94.9 

0.503 

93.7 

0.640 

Model 3 

x = -2/3 

94.4 

0.384 

93.9 

0.329 

94.9 

0.405 

94.1 

0.495 

x = -1/3 

93.9 

0.328 

91.4 

0.277 

94.1 

0.336 

94.0 

0.405 

x = 0 

94.5 

0.329 

87.5 

0.277 

95.8 

0.334 

92.3 

0.406 

x = 1/3 

71.2 

0.331 

77.5 

0.281 

73.0 

0.343 

73.3 

0.410 

x = 2/3 

81.4 

0.399 

74.7 

0.343 

68.9 

0.423 

88.0 

0.516 

Model 4 

x = -2/3 

94.9 

0.507 

95.1 

0.426 

95.0 

0.513 

93.9 

0.646 

x = -1/3 

90.2 

0.418 

51.8 

0.358 

93.9 

0.425 

91.2 

0.519 

x = 0 

0.0 

0.451 

0.0 

0.403 

0.0 

0.448 

88.8 

0.645 

x = 1/3 

90.3 

0.417 

52.3 

0.357 

93.5 

0.424 

90.5 

0.519 

x = 2/3 

95.4 

0.508 

95.0 

0.427 

94.9 

0.514 

93.9 

0.657 

Model 5 

x = -2/3 

94.6 

0.560 

95.0 

0.470 

94.4 

0.562 

93.7 

0.722 

x = -1/3 

85.1 

0.437 

55.0 

0.370 

93.1 

0.440 

92.8 

0.540 

x = 0 

90.8 

0.402 

73.5 

0.340 

92.0 

0.410 

92.6 

0.507 

x = 1/3 

94.4 

0.378 

94.1 

0.319 

92.2 

0.385 

92.9 

0.470 

x = 2/3 

95.2 

0.442 

94.7 

0.373 

95.0 

0.454 

93.6 

0.562 

Model 6 

x = -2/3 

94.3 

0.368 

93.2 

0.317 

94.9 

0.392 

93.9 

0.475 

x = -1/3 

94.9 

0.362 

94.4 

0.305 

94.5 

0.366 

94.0 

0.445 

x = 0 

94.1 

0.367 

93.0 

0.309 

94.9 

0.372 

93.3 

0.450 

x = 1/3 

92.6 

0.372 

86.8 

0.313 

93.6 

0.377 

93.3 

0.458 

x = 2/3 

94.8 

0.407 

94.5 

0.344 

94.7 

0.427 

93.7 

0.520 


Notes: Undersmoothing is implemented using bandwidths h = Xh mse for A = {0.5; 0.7}, in the columns labeled 
as such. 
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Table S.II.9: Empirical Coverage and Average Interval Length of RBC 95% Confidence Inter¬ 
vals for Model 5, for Different Variance Estimators 



h 

EC 

IL 

x = -2/3 

HC 0 

0.248 

94.2 

0.561 

HC\ 

0.248 

94.5 

0.556 

hc 2 

0.249 

94.5 

0.562 

hc 3 

0.249 

94.5 

0.559 

NN 

0.251 

94.8 

0.560 

x = -1/3 

HC 0 

0.400 

92.3 

0.440 

HC\ 

0.403 

91.9 

0.436 

hc 2 

0.404 

92.0 

0.439 

hc 3 

0.404 

91.9 

0.437 

NN 

0.405 

93.4 

0.439 

x = 0 

HC 0 

0.472 

92.2 

0.403 

HC\ 

0.471 

92.2 

0.407 

hc 2 

0.472 

92.6 

0.409 

hc 3 

0.472 

92.4 

0.408 

NN 

0.475 

92.8 

0.409 

x = 1/3 

HC 0 

0.543 

90.6 

0.378 

HC\ 

0.535 

91.0 

0.382 

hc 2 

0.536 

91.0 

0.384 

hc 3 

0.536 

91.0 

0.383 

NN 

0.536 

92.1 

0.384 

x = 2/3 

HC 0 

0.403 

93.6 

0.451 

HCi 

0.399 

93.8 

0.449 

hc 2 

0.400 

94.1 

0.452 

hc 3 

0.400 

94.0 

0.451 

NN 

0.400 

94.4 

0.453 


Notes: 

(i) The h column reports the average estimated bandwidths h rot . 
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Figure S.II.2: Empirical Coverage of 95% Confidence Intervals - Model 1 
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Figure S.II.3: Empirical Coverage of 95% Confidence Intervals - Model 2 
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Figure S.II.4: Empirical Coverage of 95% Confidence Intervals - Model 3 
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Figure S.II.5: Empirical Coverage of 95% Confidence Intervals - Model 4 
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Figure S.II.6: Empirical Coverage of 95% Confidence Intervals - Model 5 
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Figure S.II.7: Empirical Coverage of 95% Confidence Intervals - Model 6 
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Figure S.II.8: Average Interval Length of 95% Confidence Intervals - Model 1 
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Figure S.II.9: Average Interval Length of 95% Confidence Intervals - Model 2 
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Figure S.II.10: Average Interval Length of 95% Confidence Intervals - Model 3 
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Figure S.II.ll: Average Interval Length of 95% Confidence Intervals - Model 4 
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Figure S.II.12: Average Interval Length of 95% Confidence Intervals - Model 5 
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Figure S.II.13: Average Interval Length of 95% Confidence Intervals - Model 6 
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Figure S.II. 14: Empirical Coverage and Average Interval Length of 95% Confidence Intervals - Model 1 



LcDh-cqajoJciddd" 

, O O O O O ^jT lp' 

jincoNDociaicnoiajO) 

oddddddddd 


O 

CD 

□: 


c/) 

3 


10 


90 


—r~ 

so 


o 

CD 

DC 


CO 

3 


VO 


£0 


30 


CO 


10 


90 


SO 


VO 


£0 


30 


I 

CO 


c n 

3 


114 


RBC US RBC 

















































































































































































































































































































Figure S.II. 15: Empirical Coverage and Average Interval Length of 95% Confidence Intervals - Model 2 
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Figure S.II. 16: Empirical Coverage and Average Interval Length of 95% Confidence Intervals - Model 3 
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Figure S.II.17: Empirical Coverage and Average Interval Length of 95% Confidence Intervals - Model 4 
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Figure S.II. 18: Empirical Coverage and Average Interval Length of 95% Confidence Intervals - Model 5 
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Figure S.II. 19: Empirical Coverage and Average Interval Length of 95% Confidence Intervals - Model 6 
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